Biological Sequence Quality Control and Organization
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
During the past year, we improved our software package VADR (https://github.com/nawrockie/vadr) for viral sequence annotation using models based on RefSeq annotation. VADR aligns complete input sequences to their nearest RefSeq sequence and uses that alignment to map the RefSeq annotation onto the input sequences. VADR is now being used by GenBank to help annotate Norovirus, Dengue virus, SARS-CoV-2 and metazoan cytochrome C oxidase subunit 1 (COX1) protein-coding gene sequences. Most of the development of VADR in the past year was aimed at facilitating and improving validation and annotation of SARS-CoV-2 sequences. In December 2020, we added a new reference model for the B.1.1.7 variant. In February 2021, we adapted the software to perform less stringent quality checks on ORF8 after observing a relative large number of validated mutations in that protein coding region (VADR v1.1.3). In April 2021, we added a new model for the B.1.525 variant and released v1.2 which accelerated sequence processing by about 10-fold and reduced the memory requirement about 30-fold to cope with the increasing size of sequence submissions from state public health labs and from the CDC. Finally, in August 2021, we released a new version (v1.3) that reduced the stringency of quality checks on ORF3a, ORF6, ORF7a, ORF7b and ORF10 to allow additional sequences without problems in more essential coding regions (e.g. the spike coding region) to pass VADR and be deposited into GenBank. Version 1.3 also reports positional information related to all errors to enable users to more easily investigate the reasons any of their sequences failed VADR and were not deposited into GenBank. In February 2021, we released a new version of the Ribovore software package used for ribosomal RNA sequence analysis in various contexts at GenBank (v1.0). We also submitted a paper on Ribovore to BMC Bioinformatics which has been accepted but not yet published. To date, the ribosensor program, which is part of Ribovore, has been used to screen more than 50 million ribosomal RNA sequences submitted to GenBank.
View original record on NIH RePORTER →