Bilateral BBSRC-NSF/BIO: ABI Innovation: Data-driven hierarchical analysis of de novo transcriptomes
Suny At Stony Brook, Stony Brook NY
Investigators
Abstract
This project will determine what methods are needed when researchers are studying how genes are expressed in organisms that do not have a completed DNA sequence of the genome. Genes provide the potential for functions an organism can carry out under the right conditions, and genes are expressed to make sure cells can perform the functions they need to live in their environment. Each gene can have a family of expressed forms, and their relationships can be very difficult to sort out without the sequence of the gene itself for comparison, but that sequence is missing for most organisms. The view of gene expression obtained from such experiments is typically fractured, incomplete, and difficult to analyze, the central problem this research addresses. Once reliable analysis methods are worked out, this research will produce software tools that use the methods and are designed to take common errors and missing parts of the data into account. As a check, the reconstructed gene sequences will be compared to known genes in related organisms, and strong relationships will be used to guide predictions of the genes' functions. All predictions will carry with them a number indicating the uncertainty of the information. To ground the research in a biologically interesting question the methods and software will be used to study plants that use an unusual form of energy conversion, C4 photosynthesis, and compare it to the more common form used by plants in order to investigate the different genetic mechanisms, including regulation, used by each type of plant. Predictions will be tested with wet-lab experiments, and analysis methods will then be improved as needed. An on-line community that shares the interests of the researchers will be created and fostered by providing expert advice for carrying out and analyzing similar experiments. This project will consist of the development of a novel collection of methods, and an integrated set of tools, for the analysis of de novo transcriptomes. There are currently a number of tools that aim to tackle different phases of the de novo transcriptome analysis pipeline (e.g. assembly, clustering, expression quantification and differential expression testing), but none of these provide a well-integrated, principled and efficient approach to this difficult challenge. The methods developed and validated in this project will provide a state-of-the-art pipeline for posing and answering a host of relevant biological questions about how transcripts, genes, and functional modules are differentially expressed and regulated; specifically, in the context of organisms for which a reference genome is lacking. The project will result in the development of novel methods for the data-driven clustering of contigs in de novo assemblies. Clustering will be decided on the basis of sequence and expression-level similarities between contigs, accounting for known hallmarks of mis-assemblies to determine contigs that arise from the same underlying transcript. Transcripts will be grouped by predicted, shared exonic structure to discover sub-genic features, genes, and gene families. The result of this process will be a hierarchical model of the transcriptome. New methods for efficient quantification and differential expression analysis of these hierarchical models will be developed, including a methodologically-sound approach for propagating measures of quantification uncertainty into downstream analysis. Finally, these tools will be validated via a large-scale reanalysis of existing de novo transcriptome data targeted at elucidating the identity of genetic regulatory elements involved in C4 photosynthesis. This reanalysis will harness the increased accuracy and efficiency of the methods developed to analyze data from all previous non-model C4 RNA-seq experiments in a single, multi-scale model. The output of the model will be a system for quantitatively prioritizing candidate regulatory elements for detailed molecular investigation. This project also includes broader impact goals that will provide research opportunities to undergraduate students, and which will create an actively-maintained and inclusive online community centered around best practices in de novo transcriptome analysis and experimental design. For further information regarding progress on this project, including the relevant software being developed, please visit https://combine-lab.github.io/txome.
View original record on NSF Award Search →