HMMER and Infernal: Finding distant homologs of sequences and RNA structures

$422,500R01FY2017HGNIH

Harvard University, Cambridge MA

Investigators

Linked publications & trials

Paper 36511586 Paper 35588743 Paper 35255082 Paper 34751130 Paper 33253143 Paper 33211869 Paper 33137085 Paper 30357350 Paper 29905871 Paper 29788499 Paper 29112718

Abstract

Project Summary/Abstract Fast and sensitive sequence homology searches are fundamental tools in molecular biology. Our understanding of the human genome sequence depends in part on comparative sequence analysis of more experimentally ac- cessible model organisms, and indeed on sequence comparisons across the tree of life. This proposal describes a plan to support two software packages for sequence homology search and alignment, HMMER and Infernal. HMMER is for protein and DNA sequence comparison, and it underlies many protein domain family databases and many genome sequence annotation procedures. Infernal is for RNA secondary structure/sequence com- parison, and it is the foundation of various RNA structure/sequence analysis tools including the Rfam database of RNA families. Recent developments ? including a new collaboration with the EMBL European Bioinformatics Institute to provide HMMER web servers, an upcoming HMMER4 release with new memory-ef?cient algorithms, and an expansion of the development teams to multiple universities and sites ? suggest that beyond their current niches in genome analysis, both software packages are in a position to increase the scope and importance of their applications. To improve the foundation of software engineering in these packages, the proposal has three speci?c aims for improving speed, scaling, and support. The ?rst aim focuses on speed improvements, especially in paral- lelization, both on typical desktop computers and on high performance computing resources. A measurable and important milestone of this aim is to make sequence homology searches run at interactive speeds (less than 1 second response time), the speed of a Google search, which could radically change the way biologists interact with sequence data. The second aim focuses on scaling improvements. Biological sequence data are growing exponentially, and we will make sure that the software can handle ? and help biologists visualize ? very large numbers of signi?cant homologs, up to millions and more. The third aim focuses on improving support for the software, especially in improving our ability to engage a wider community of academic and industry developers who contribute to our codebases, and who use parts of our codebases in their own work.

View original record on NIH RePORTER →