Statistical phrase extraction techniques in natural language databases.
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
The ability to locate important phrases in natural language text is useful for the purposes of indexing or placing hyperlinks in text. In either case one seeks to improve access to the textual material. In the past the most common method used for the location of phrases has been a part of speech tagger. We have developed a new approach that uses a number of scoring algorithms to rank phrases as to how useful they may be. Eight different methods have been developed and tested. They have proved effective in ranking known phrases from the Unified Medical Language System developed by the National Library of Medicine high among all the phrases obtained from subsets of the Medline document collection. Six of the methods have been combined to produce optimal scoring methods and have proven useful in extracting material of quality similar to that already in the UMLS. They also appear promising as a way to mark text with hyperlinks for navigation purposes. Two papers are being published on this topic and the methods are being applied to the electronic text book project at NCBI.
View original record on NIH RePORTER →