III: Small: RUI: Efficient Search, Comparison, and Annotation for Biological Sequences

$76,756FY2015CSENSF

East Texas A&M University, Commerce TX

Investigators

Abstract

This project aims to develop fast algorithms for searching, comparing, and annotating protein and RNA sequences. Regular expression matching is commonly used in UNIX-based systems for searching texts, and in PROSITE website for searching patterns in protein sequences. Context free grammar matching is used in searching RNA sequences. Annotating biological sequences (DNA, protein, and RNA sequences) using regular expression and context free grammar described motifs is an important application available in many public databases, websites, and software tools (e.g. PROSITE, Locomotif). The results of the proposed project will be helpful for programmers who develop pattern matching and parsing applications which would benefit from fast algorithms for searching and annotating sequences. An example of such an application outside bioinformatics is parsing with multiple context free grammars, which is used in natural language processing and program compiling. There will be strong student involvement during the entire project. As implementations become complete, students will help the PI present these implementations, and make them available for use in project web pages. This project involves fundamental computer science theory with applications in bioinformatics. It will yield new knowledge and case study results in automata and formal languages which are essential parts of the computer science curriculum. It will also help involved students master these topics. This project will develop algorithms for searching, comparing and annotating protein and RNA sequences by using (1) Seed-based matching: For finding an approximate match to a given sequence, popular alignment and search tool BLAST locates first an exact match of fixed length region (seed) and extends the matching region around the seed. This project generalizes the use of seeds in novel ways to pattern matching and annotation problems for regular expression and context free grammar described patterns. Initial results indicate that the proposed seed-based approach finds matches about 2.5 times faster than UNIX GREP utility on 2MB texts; (2) Suffix tree/array-based matching: For the annotation problem with bounded-length patterns over fixed alphabets, this project proposes using a suffix tree (or a suffix array) extended with additional information in order to identify from a candidate set, a regular expression or a context free grammar that generates a given string; and (3) A new representation for RNA: This project proposes a new RNA secondary structure representation in which two-dimensional structure information is embedded in the sequence with desirable features. For RNA sequences, the proposed new algorithms will exploit the advantages of this representation for fast RNA search, annotation, and comparison (of multiple RNAs to locate common substructures). Seed-based search and RNA comparison problems (using the new representation) will be addressed in the first year, and annotation problems in the second year as sequence annotation will make use of the results from the first year. The project will create experimental databases from publicly available databases such as PROSITE, Rfam, RNA STRAND, rCAD (in particular RNA sequences in .bpseq files) for the purpose of explaining and showing the results of the developed algorithms. Every semester, during the project, students will be involved in developing, implementing, testing new algorithms, user interfaces, and relevant support tools.

View original record on NSF Award Search →