Tracking Evolution and Spread of Viral Genomes by Geospatial Observation Error

$461,012R01FY2019AINIH

Arizona State University-Tempe Campus, Tempe AZ

Investigators

Matthew Scotchcontact Graciela Gonzalez Hernandez

Linked publications & trials

Paper 32798768 Paper 32683454 Paper 31278682 Paper 30864314 Paper 30838129 Paper 30412219 Paper 29950020 Paper 29240889 Paper 29026650 Paper 28950376 Paper 28815119 Paper 28405027 Paper 28322313

Abstract

? DESCRIPTION (provided by applicant): Tracking evolutionary changes in viral genomes and their spread often requires the use of data deposited in public databases such as GenBank, the Influenza Research Database (IRD), or the Virus Pathogen Resource (ViPR). GenBank provides an abundance of available viral sequence data for phylogeography. Sequences and their metadata can be downloaded and imported into software applications that generate phylogeographic trees and models for surveillance. IRD and ViPR are NIH/NIAID funded programs that import data from GenBank but contain additional data sources, visualization, and search tools for their users. Tracking evolutionary changes and spread also requires the geospatial assignment of taxa, which is often obtained from GenBank metadata. Unfortunately, geospatial metadata such as host location is often uncertain in GenBank entries, with only 36% containing a precise location such as a county, town, or region within a state. For example, information such as China or USA was indicated instead of Beijing or Bedford, NH. While town or county might be included in the corresponding journal article, this valuable information is not available for immediate use unless it is extracted and then linked back to the appropriate sequence. The goal of our work is to enable health agencies and other researchers to automatically generate phylogeographic models that incorporate enhanced geospatial data for better estimates of virus spread. This proposal focuses on developing and applying information extraction and statistical phylogeography approaches to enhance models that track evolutionary changes in viral genomes and their spread. We propose a framework that uses natural language processing (NLP) for the automatic extraction of relevant geospatial data from the literature, and assigns a confidence between such geospatial mentions and the GenBank record. We will then use these locations and the estimates as observation error in the creation of phylogeographic models of zoonotic virus spread. We hypothesize that a combined NLP-phylogeography infrastructure that produces models that include observation error in the geospatial assignment of taxa will be closer to a gold standard than phylogeographic models that do not include them. Our research will extend phylogeography and zoonotic surveillance by: creating a NLP infrastructure that will improve the level of detail of geospatial data for phylogeography of zoonotic viruses (Aim 1), develop phylogeographic models using the estimates from Aim 1 as observation error (Aim 2), and evaluating our approach by comparing the models it produces to models that do not account for observation error in the geospatial assignment of taxa (Aim 3). We will allow users to generate enhanced models and view results on a web portal accessible via a LinkOut feature from GenBank, IRD, and ViPR. The addition of more precise geospatial information in building such models could enable health agencies to better target areas that represent the greatest public health risk.

View original record on NIH RePORTER →