A software tool to facilitate variable-level equivalency and harmonization in research data: Leveraging the NIH Common Data Elements Repository to link concepts and measures in an open format

$275,541R43FY2023AGNIH

Algenta Technologies, Minneapolis MN

Investigators

Abstract

Abstract The National Institute on Aging (NIA) supports numerous studies and archives that collect and disseminate critical data about the aging population of the United States. By supporting the collection and dissemination of longitudinal and multidisciplinary data, the NIA provides researchers the opportunity to measure change and stability in individuals over time, as well as to investigate aging phenomena from an integrated theoretical perspective. In both cases, equivalent or related variables must first be linked or merged before producing appropriately documented data products for eventual harmonization and analysis. The current aging research data environment provides many opportunities for linking similar topical datasets and harmonizing extant common variables, but few software tools are available to facilitate this resource-intensive task. The proposed project will demonstrate the feasibility of a guided harmonization software prototype by concording variables from three nationally representative NIA-funded studies (MIDUS, NHATS, NSHAP) and mapping them against extant data element concept sources such as the NIH Common Data Elements library to identify equivalent concepts and variables. The software prototype will use machine learning and advanced text analysis algorithms to guide the creation of concorded databases (variable crosswalks) that support harmonization and discoverability, both within and across aging-related statistical datasets. Additionally, the prototype will use an open-standards metadata framework to produce richly-described concordance databases that are interoperable, citable and FAIR. Colectica has a track record of creating open- standards based software tools that reduce data management burden by automatically extracting structured metadata from macro-level (study) and micro-level (variable) characteristics of aging studies. Specifically, the prototype will evaluate the feasibility of human-in-the-loop algorithms to operate as a ârecommendation engineâ to guide the concordance of potentially equivalent or similar variables among multiple datasets. The core hypothesis posits that the prototype will significantly decrease the labor, time, and resources required to create accurate and standardized concorded databases. To test this hypothesis, the research team will: construct and evaluate recommendation algorithms for variable concordance (Aim 1); establish metrics for measuring the accuracy and effectiveness of concordance (Aim 2); and create a user interface to test the recommendation engine, its functions, and associated inputs and outputs (Aim 3).

View original record on NIH RePORTER →