Free Text Gene Name Recognition

$0Z01FY2003LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 17238442 Paper 16843731 Paper 15960837 Paper 15290756 Paper 15130538 Paper 15073016 Paper 15016385 Paper 12798042 Paper 12463959 Paper 12176836

Abstract

We have begun the development of a system to recognize gene or protein names in natural language text. The system currently consists of two modules. One is a Bayes text classifier that we have trained on over 500k documents that contain known gene names. These documents are compared with the remainder of the text in PubMed and the difference is learned using the naive Bayes classifier. The second module is the Brill tagger that we have modified to run on text with a biological orientation. We have taken an additional step to teach the tagger to tag gene names consisting of a single word with a GENE tag. Several hundred additional rules have been learned in this regard. Several processing steps are applied as filters after the tagger to identify gene names which are multi-term, etc. We are currently evaluating the performance of this system in recognizing gene names in a test set of text. The plan is to continue work on this system and to incorporate new approaches into the basic system to improve it further. The system has achieved a level of success that has attracted several groups outside of the NIH to use the system for gene/protein name identification.

View original record on NIH RePORTER →