Natural Language Processing Techniques To Enhance Information Access.

$539,651ZIAFY2011LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Abstract

Recently we have been involved in four subprojects which use natural language processing techniques: 1) We have developed a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data. We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals. 2) We are studying paraphrases in MEDLINE abstracts. These come about because an author is describing some entity of interest and uses a phrase like "drug abuse" and then needing to describe the same entity again a sentence or two latter does not wish to use exactly the same wording again and may use a variant of the phrase such as "drug use" which in the context of "drug abuse" has substantially the same meaning. 3) An author disambiguation algorithm has been developed which relies on machine learning based on the assumption that if an author name is infrequent in the data it probably represents the same person in for all documents where it is found. This gives us positive instances. Negative instances are sampled from pairs of documents that have no author in common. Such positive and negative data allows us to do machine learning on all aspects of the document other than the name in question. This allows us to learn how to weight this data for best performance in distinguishing the positive and negative instances from each other. This learning is then applied in individual name cases or spaces to determine which author document pairs represent the same author.

View original record on NIH RePORTER →