GGrantIndex
← Search

Development of Data Science Methods

$994,428ZIAFY2021ESNIH

National Institute Of Environmental Health Sciences

Investigators

Linked publications & trials

Abstract

In my role as Director of the Office of Data Science (ODS), I support a programmatic effort to support NIEHS and environmental health researchers in acquiring, collecting, storing, representing, analyzing, and disseminating computerized research data, information, and knowledge. Much of the work ODS conducts is service-based, ensuring researchers have support in working with research data. To fully support the goals of researchers, I work with ODS staff, collaborators and researchers in identifying, developing, and translating new methods and approaches in data science into research practice. A challenge in environmental health research is that information on exposures is difficult to capture at the individual level. As such, researchers often make use of survey instruments that ask study participants to indicate exposures at a level that is familiar to participants but difficult to use in scientific research. For example, participants may be asked about hobbies they participate in and foods they eat, while researchers are interested in the specific hobby-related chemicals and food-related nutrients. Participant-level exposures are often related to each other, including by chemical constituents, routes of exposure, and subpopulations that receive the exposure. These relationships are often known in a general sense, but are not represented in a computerized format that can facilitate across-exposure analysis. To improve the utility of exposure survey instruments, I began collaborating this past year with Lauren Chan, Dr Anne Thessen, and Dr Melissa Haendel at Oregon State and Shepherd Schurmann at NIEHS to investigate ways to link the survey questions to biomedical, chemical, and food ontologies. This linkage will provide a knowledge framework to support interpretation and analysis of exposure surveys. Lauren Chan has accepted a NIEHS summer internship under my direction to focus on this work in 2021. Our work is focusing on three survey instruments that are used in the NIEHS Environmental Polymorphism Registry (EPR) study. The EPR surveys are based on surveys used in the large-scale NHANES and National Sister Study, so success in this effort will support research beyond the current EPR focus. A significant challenge in environmental health research is extracting and aggregating known information about exposures that are recorded within research journals in free text, figures, and images. I am collaborating with program staff within the Division of the National Toxicology Program (DNTP) and at the Oak Ridge National Laboratory to advance the development and application of methods to extract information from unstructured text and tables using natural language processing (NLP) and image analysis techniques. Our efforts are advancing techniques in three main areas that include: 1) extracting tabular information from tables within papers, 2) developing classification models that determine if papers meet certain quality of research metrics, and 3) developing weak supervision techniques to reduce the cost of building training data that are needed to develop subsequent NLP models. Progress in the first area has progressed sufficiently that the DNTP is piloting the approach as part of efforts to convert legacy scientific data into usable formats. I have been collaborating with Dr Scott Auerbach at DNTP and Drs Alex Tropsha and Vinicius Alves of the University of North Carolina at Chapel Hill (UNC) on a research project to use chem-informatics based analytics approaches to determine which organ systems are most likely to be impacted by an exposure to a novel chemical. This project has entailed collecting and cleaning a diverse set of toxicology testing data to establish a data set for model development followed by development of predictive models based on machine learning algorithms. Based on the research, Dr Alves has development a web-based scientific application for visualizing model results. This work was done while Dr. Alves was a postdoc at NIEHS during 2020-2021. In summer 2021 Dr Alves left the postdoc position to take a faculty position at UNC. The results of the work, including the web application, are currently being prepared for dissemination. In 2020 and 2021, some of the effort under this project was diverted to development and testing of chem-informatics QSAR models to support COVID-19/SARS based research. A grand challenge for environmental health research is gaining access to the vast array of existing scientific data and knowledge in an integrated manner that facilitates analysis and interpretation. I have been involved in ongoing collaborations with the NCATS funded Data Translator program that is focused on advancing the development of computational infrastructure and tools for integrating scientific knowledge. The Translator program has been developing knowledge bases from existing biomedical knowledge repositories, graph-based algorithms and methods for asking biomedical questions against the knowledge bases, as well as new methods for exposing environmental and clinical data through the Translator architecture. I was on the Translator program prior to joining NIEHS and now I continue to advise and collaborate with portions of the program to promote the use of the technology for environmental health research.

View original record on NIH RePORTER →