Fast k-mer Counting to Quantify Gene Expression and Improve Genome Assembly

$240,627R21FY2012HGNIH

Carnegie-Mellon University, Pittsburgh PA

Investigators

Linked publications & trials

Paper 29990201 Paper 29641248 Paper 28263959 Paper 27821901 Paper 27267775 Paper 27153676 Paper 27059896 Paper 26854477 Paper 26463378 Paper 25649622 Paper 24868242 Paper 24752080 Paper 24089144 Paper 23990416 Paper 23812989

Abstract

DESCRIPTION (provided by applicant): We propose to investigate new computational approaches to two central problems of high-throughput se- quence analysis: (1) quantification of transcript and species abundance in RNAseq and metagenomic data, and (2) improved error correction of sequencing reads. The proposed novel approaches to both of these problems derive from the ability to quickly count every instance of every k-mer (string of length k) within huge collections of sequence data. Extensive preliminary work on this problem, manifest in the k-mer counting software (called Jellyfish) published by the project personnel, will be brought to bear and extended. Existing mapping-based computational techniques for quantifying transcript abundance have found wide applicability but read mapping is error prone due to, e.g., splice junctions, microexons, and variation from the reference sequence. Aim 1 seeks to develop an alternative, mapping-free approach to transcript quantification from sequencing data that relies on clustering normalized k-mer count vectors to identify k-mers that are indicative of transcript or gene abundance. These k-mers form profiles that can be used to rapidly quantify expression of the given transcript or gene in subsequent experiments with limited computational effort and avoiding the challenging read mapping step. Aim 2 tackles the problem of error correction of genomic, and, more speculatively, RNAseq reads by developing more accurate k-mer filtering methods and more compact de Bruijn graph representations. The new filtering proce- dures try to make a better distinction between correct and erroneous k-mers by simultaneously considering their position within the reads and the distribution of their quality scores across reads. Improved error correction and de Bruijn graph representations will be used for more efficient algorithms for super-read and unitig creation, the initial stages of assembly. The methods and software developed for both aims will significantly increase the ability of high-throughput sequence analysis and assembly to be completed on widely available commodity computers.

View original record on NIH RePORTER →