IIBR:Informatics:Toward an Automated RNA-seq Bioinformatician

$546,148FY2020BIONSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

Measurement of gene expression --- which genes are active in which conditions --- is an indispensable tool for understanding biological systems. Analysis of gene expression from modern genomic sequencing technologies requires the use of sophisticated software such as read mappers, transcript assemblers, and expression abundance estimators. A software program implementing one of these steps typically has a large number of user-settable parameters that influence how the analysis algorithm performs. Scientists,biologists, and clinical researchers must often tune these parameters by hand or through other ad hoc means. The goal of this project is to automate this process by designing and implementing a framework for automatically learning high-performing parameters for gene expression analysis software. This project also aims to develop algorithms, software, and methodology to make this framework practical and useful. This will allow more researchers to obtain high-quality gene expression analyses with significantly less effort and will also enable improved analysis of large data sets where per-sample parameter tuning by hand is impractical. Reproducibility of biological results will also be enhanced since the choice of parameters is explicitly ceded to an automated, repeatable process. This research will make biological studies involving gene expression more accurate and less costly. A number of educational and outreach activities for various levels of students (elementary through undergraduate) are planned to enhance community understanding of gene expression and its analysis. The developed processes will be implemented in several wrapper tools for parameter optimization that can be dropped into existing RNA-seq analysis pipelines to improve accuracy at each step. The research to design these tools will be broken down into several more tractable steps. The first step will be learning, for each tool, a collection of representative parameter vectors by analyzing large collections of existing RNA-seq samples. In the second step, machine learning methods, based on a combination of techniques such as Bayesian Optimization, genetic algorithms, and classification approaches, will be used to design techniques to select parameter vectors from these sets that are predicted to offer high performance. In the third step, techniques for providing human-interpretable rationales for the automatic parameter choices will be designed and implemented. The design of this system will also enhance our practical knowledge of techniques for such parameter optimization in other application domains within biology. Results from the project can be foun This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →