HIGH THROUGHPUT ANNOTATION OF GENOMIC DNA SEQUENCE

$566,554R01FY2002HGNIH

University Of Pennsylvania, Philadelphia PA

Investigators

Linked publications & trials

Paper 16725306 Paper 15833120 Paper 15608230 Paper 15145818 Paper 14990441 Paper 14708120 Paper 12762848 Paper 11932249 Paper 11591652

Abstract

DESCRIPTION (Applicant's abstract): Now that a working draft sequence of the human genome is in hand and an ongoing effort is in place to provide a draft of the mouse genome, the challenge is to identify the genes encoded by these genomes. Several efforts are underway in this regard including our own using ab initio gene finders and transcribed sequences in the form of mRNAs and ESTs. Gene prediction is the first step in identifying genes. Additional steps are to predict the function of those genes and associate any other information such as where (and when) the gene might be expressed. The goal of the proposed project is to provide a public database that will provide a central repository of gene predictions and associated annotation. The project will provide data integration such that predictions and annotations for the same gene (as defined by co-localizing to the same genomic location) will be linked. Associated annotation will be extended to include functional predictions and expression profiles. The intended users of the database are researchers seeking to extend their knowledge of a gene starting with an expression profile, a cDNA, or a genetic locus or to search generally for candidates genes. The prototype annotation framework for genomic sequence, GAIA, has been combined with prototypes for a gene index of ESTs and mRNAs, DoTS, and gene integration, EpoDB. The result is a database based on a global schema, GUS, that integrates sequence-centered entries from GenBank, dbEST, and SWISS-PROT and transforms the entries into gene-centered entities. This process includes data cleansing and adding value through annotation of the resultant genes (mRNAs and proteins). A first pass of this resource is on-line with ad hoc boolean queries and integrated visual tools as www.allgenes.org. The resource will provide an integrated set of known and predicted genes from GenBank, gene finders, and assembled ESTs and mRNA. Ontologies will be used to structure the annotations of biological concepts and gene function. Gene expression information will be augmented with RAD (RNA Abundance Database). No other public resource of this nature currently exists. Data currency of this resource will be maintained through periodic updates every 2-3 months. The updates will include integration of previously annotated genes with newly available GenBank and dbEST entries and recalculation of gene similarities, gene location, tissue distribution, and gene function. An annotation interface has been developed to complement and extend computational analysis through manual assessment of predictions for genes and their functions. Radiation hybrid mapping data for mouse sequences will be incorporated as has been done for human ESTs. Links between the genes in GUS and gene expression data in RAD will be established. To respond to the public community, queries to the web interface will be incorporated and bulk files provided in response to users of the allgenes.org site. Planned is the inclusion of on-demand annotation of new contigs.

View original record on NIH RePORTER →