High-performance Computing System for Bioinformatics

$461,402S10FY2009RRNIH

Duke University, Durham NC

Investigators

Linked publications & trials

Abstract

DESCRIPTION (provided by applicant): The explosive growth of computational biology has made it difficult for research organizations to keep pace with users'demands for ever-increasing computational power. The complications that biologists face come from two developments. First, new technologies generate huge amounts of data. Of course this makes it possible for biological investigations to broaden their scope to whole genome, cellular, and even organism levels, but at a cost of overtaxing existing methods and resources for data analysis. Second, algorithms and methods of analysis have become more computationally intensive, in part as a response to the opportunities that data richness has brought about and in part to manage the unfortunate signal-to-noise ratio that seem implicit in genomic datasets. Also, the emergence of "systems biology" has led to growing complication in computational work, since systems biology seeks eventually to model biological phenomena in silico. In effect, one of the major -- and indeed the most flexible -- instruments for genome scientists and systems biologists is the high performance computer, because it is an essential tool for making sense of the prodigious amounts of data already coming from high-throughput sequencers, gene expression microarray equipment, mass spectrometers, and the like. On high-performance cluster computers, many researchers are making use of basic "job-level parallelism" by which a single user may run multiple jobs (or independent sub-parts of jobs) on many hundreds of computers at once. Often, this is in the form of computational "parameter space studies" where the same application is run on tens, hundreds and thousands of different sets of inputs. Simulating the evolution of regulatory regions, for example, requires multiple runs in which the size and number of short regulatory motifs are tuned. The prediction of gene regulatory networks requires multiple simulations in which different cell types and different tissue regions are modified. Simulations of gene expression dynamics in populations of cells must also be run multiple times in order to account for "cellular noise" and get a comprehensive picture of the phenomena. This need for repeated computations makes cluster computing an attractive approach for these problems. Our proposal requests 94 power-efficient compute servers and about 8 terabytes (usable) high-speed data storage with matched disaster recovery storage. This equipment will be put into operation using Sun Grid Engine, a software application that coordinates computational resources so that individual machines function as one clustered computational instrument. Bioinformatic software tools, as well as custom-made applications, are available for researchers to use on the equipment. PUBLIC HEALTH RELEVANCE: Next-generation instruments have made acquiring genomic data inexpensive and ever more efficient, and new technologies promise to add greatly to the resolution and richness of data used for biomedical research and for translational medicine. This torrent of data needs equally powerful and flexible tools for analysis and information creation, in effect matching high-throughput data producers with high performance computational tools for analysis. We propose the creation of a well integrated computational system that matches in compute power the prodigious data flows from instruments producing genomic data.

View original record on NIH RePORTER →