Lustre data storage
University Of Utah, Salt Lake City UT
Investigators
Abstract
Project Summary/Abstract Modern advancements in sequencing technology now allow us to sequence human genomes in as little as a few hours. In addition, modern instruments can sequence dozens of samples simultaneously, and they can produce over 100GB of data per sample. This deluge of sequencing data makes it difficult to store, process, and analyze resulting files as fast as they can be produced. Even many of the most competent research groups often find themselves overwhelmed by the data management and computational hurdles associated with projects that may involve hundreds or even thousands of samples. In general, the simplest solution to this type of large data problem is to divide the work of data processing across multiple machines in a high-performance computation (HPC) cluster to perform analyses faster than they can be performed on a single machine. Reading large amounts of data from traditional network mounted storage systems, however, does not scale as one might assume. Maximum data throughput (read/write IO) can be reached with just 2-4 servers accessing the same filesystem. Given this bottleneck in data bandwidth, it becomes impossible to scale genomic analyses to the hundreds of additional CPUs and server nodes that may be available on an HPC cluster. As a result, most genomic analyses tend to be IO (read/write) limited as opposed to CPU limited, and many projects involving hundreds or even thousands of terabytes of sequencing can take months or even years to fully analyze. One solution to the data bandwidth limitation is to use specially designed high-performance filesystems such as Lustre to increase the read/write bandwidth on HPC clusters. Lustre is a load balanced object storage system that can reach speeds in the hundreds of gigabytes per second compared to the just 0.5-5GB per second seen in both local storage and most traditional network mounted storage solutions. By using Lustre storage, datasets the equivalent of an entire laptop hard disk can be read in just a few seconds, and large genomic analyses can easily scale across hundreds of machines and tens of thousands of CPUs without saturating read/write IO. We propose to purchase a high-performance Lustre filesystem to support genomic analyses at the Utah Center for Genetic Discovery (UCGD) Core Facility at the University of Utah. High-performance storage systems are a critical piece of equipment for Core facilities that focus on Next Generation Sequencing (NGS) analysis. This instrument will serve across a wide range of medical disciplines including researchers in the fields of human genetics, cardiology, hematology/oncology, diabetes, psychiatry, and gastroenterology.
View original record on NIH RePORTER →