CAREER: Machine learning and signal processing methods for analyzing single-cell sequencing data

$499,991FY2020CSENSF

University Of Connecticut, Storrs CT

Investigators

Abstract

High throughput genome sequencing technologies, also known as next generation sequencing technologies, have revolutionized genomics and medicine. This achievement is mostly owed to employing computational methods to extract information from massive numbers of small fragments of genomes resulting from the new sequencing technologies. More recently, sequencing individual cells, known as single-cell sequencing, shows significant advancements over conventional sequencing of a large population of cells, called bulk cell sequencing. This is because single-cell sequencing enables discovery of new biological knowledge at the cellular level and a better understanding of the function of an individual cell, which cannot be obtained via bulk sequencing. The emerging and fast-growing single-cell sequencing technology has attracted much interest and has had major impacts on several fields, such as microbiology, neurobiology, immunology, and developmental biology. With rapid advances in single-cell technologies, single-cell sequencing data and their applications continue to grow, while computational methods for analyzing these data are lagging behind. Compared to bulk sequencing, single-cell sequencing introduces new challenges in data analysis due to the low amount of DNA and RNA from a single cell and the extra steps in the sequencing process for single cells. Moreover, thousands or millions of cells are sequenced in parallel in any given experiment, leading to massive data sets to analyze. This project will build a strong computational foundation for analyzing single-cell sequencing data. The project outcome will contribute to advancing biological sciences and improving human health by providing insight into critical biological unknowns requiring single-cell resolutions, such as evolution of cancer cells and development of stem cells. The project will also contribute to education and training in the high in-demand and multidisciplinary field of bioinformatics and computational genomics. Data analysis using advanced computational methods plays an essential role in extracting accurate and meaningful information from single-cell sequencing data. Current single-cell sequencing data analysis methods have been adapted from bulk sequencing technologies. The current methods, however, are not designed to cope with the new challenges in single-cell sequencing data analysis, such as extensive noise, zero inflation and missing data, non-uniform genome coverage, data multimodality, and large amount of data. This project aims to address the new challenges in analyzing single-cell sequencing data by developing novel computational methods and algorithms based on signal processing and machine learning techniques. The focus of this research is on identifying genomic variations in the form of copy number variations using DNA single-cell sequencing data, and clustering cells using RNA single-cell sequencing data. The developed methods and algorithms will significantly advance knowledge in extracting accurate information from complex and massive single-cell sequencing data by (i) providing optimal representation of genome coverage data by applying sparse optimization, (ii) modeling and reducing noise by employing denoising methods in signal processing, (iii) exploring information across cells by applying data-driven learning models and (iv) incorporating prior knowledge by adapting network and word embedding models. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →