Free Text Gene Name Recognition

$0Z01FY2005LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 17238442 Paper 16843731 Paper 15960837 Paper 15290756 Paper 15130538 Paper 15073016 Paper 15016385 Paper 12798042 Paper 12463959 Paper 12176836

Abstract

Currently we are pursuing two projects designed to make progress on the problem of gene/protein name recognition: 1) We have produced a set of 20,000 sentences with all occurrences of gene/protein names in them marked up with the word offset for name beginning and name ending in the sentence. The sentences were taken as random samples from restricted classes of MEDLINE abstracts. Half were chosen as likely to have gene/protein names in them and half were selected as unlikely to have such names. Since there is ambiguity in marking names, alternative markings are listed as correct answers where this is thought to be appropriate. Three fourths of these names formed the basis for a task in the recent BioCreAtIvE (Critical Assessment of Information Extraction in Biology) Workshop held in Granada, Spain this year. Twelve teams attempted to designed systems that could correctly tag the gene/protein names in the sentences. Several teams obtained precisions and recalls in the low 80% range. A number of different approaches were successful and these results suggest ways in which we can improve ABGene. 2) We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about four million name strings distributed over those categories. We are experimenting with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories.

View original record on NIH RePORTER →