A computational approach to identify non-linear sequence similarity between lncRNAs

$662,630FY2023BIONSF

University Of North Carolina At Chapel Hill, Chapel Hill NC

Investigators

Abstract

The goal of this proposal is to develop a computational approach to identify non-linear sequence similarity between long noncoding RNAs (lncRNAs). A substantial portion of the RNAs produced by eukaryotic genomes can be classified as lncRNAs, which have little or no potential to encode for protein. LncRNAs are essential across kingdoms of life owing to critical roles in gene regulation. However, progress in the field has been stifled by a lack of computational tools to identify meaningful similarity between lncRNAs. Unlike protein-coding genes, lncRNAs are not constrained by codon usage, evolve rapidly, and achieve function by employing structures or proteins in ways that are not well-understood. Thus, lncRNAs with similar functions often lack any semblance of linear sequence similarity, yet owing to a lack of other options, linear alignment remains the predominant approach for sequence comparison in the lncRNA field. As a result, studies of one lncRNA rarely inform the understanding of others, and among the thousands of unstudied lncRNAs, it is nearly impossible to computationally identify those that encode meaningful functions. However, prior research has demonstrated the proof-of-principle that when compared to linear alignment, non-linear forms of sequence comparisons can provide exponentially more information about the biological properties of lncRNAs, including a modest ability to infer molecular function. In this project, researchers will develop and validate software that will enable any biologists, regardless of computational expertise, to perform quantitative, non-linear sequence comparisons from essentially any computing resource, including a personal laptop. In concert with the development and validation of the new software, the project will provide high-quality mentored research experiences and sustained career guidance for undergraduate students hailing from underrepresented or underprivileged backgrounds, thereby encouraging their entry into science and promoting equity, diversity, and overall excellence in computational biology. The investigative team recently developed an approach called SEEKR (sequence evaluation through k-mer representation), which compares sequences by their relative abundance of substrings called k-mers. SEEKR provided some of the first evidence that lncRNAs with analogous functions can harbor similarities that are invisible by conventional forms of linear sequence alignment. Despite this success, SEEKR remains limited in its utility. Most notably, it is unable to identify regional similarity between lncRNAs, and has no means to consider local nucleotide context in similarity evaluations, each of which are essential components of lncRNA functionality. Moreover, SEEKR is qualitative and provides end-users with no ability to assess significance of its similarity scores, a critical component of all broadly-used sequence comparison tools. Thus, while SEEKR was an important proof-of-principle, it falls well short of the reliable and broadly applicable tool that the field needs to identify meaningful non-linear similarity between lncRNAs. To address these shortcomings and provide biologists with better tools to identify relationships between sequence and function in lncRNAs, this research will apply a statistical approach called the hidden Markov model (HMM) to develop a python-based software package, hmmSEEKR, that would give biologists the ability to identify regional and whole-transcript similarities in k-mer content between any set of lncRNAs. An early version of hmmSEEKR enabled the identification of known protein-binding domains and functionally characterized lncRNAs from within the mammalian transcriptome, feats that to the knowledge of the investigative team, have not previously been achieved, including using SEEKR. hmmSEEKR will be rigorously validated and tuned by identifying commonalities in protein-binding profiles between a set of known lncRNA functional domains and other lncRNA-like domains from within the transcriptome. Findings will be published in open access journals, including recommendations for default parameters, and a vetted version of hmmSEEKR will be deposited in GitHub and the Python Package Index. A usage manual and links to results will also be posted on https://www.med.unc.edu/pharm/calabreselab/seekr/. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →