Towards a FAIR Digital Ecosystem in the Cloud
Harvard University, Cambridge MA
Investigators
Linked publications & trials
Abstract
Cloud Computing, Big Data Analytics, and Artificial Intelligence are transforming biomedical research. The NIH Data Commons will provide a common, cloud-agnostic, harmonized environment where these technologies can be deployed to serve NIH intra- and extra-mural researchers, implementing FAIR principles [Wilkinson2016]. Our proposal provides two essential capabilities: (1) Global Unique Identification (GUIDs) and (2) Digital Object Publication, Citation, Replication and Reuse. We propose a FAIR biomedical ecosystem where all primary and derived digital objects (e.g., datasets, software) receive GUIDs, assisting with Findability and Accessibility, Reusability, data provenance, reproducibility, and accountability of research outcomes. GUIDs will provide full Interoperability between DOIs and prefix: accession based Compact Identifiers through a common resolution services prefix registry. All digital objects will interoperate with multiple, hybrid clouds?open source and commercial?enabling researchers to select computing resources best matching their needs. 1 - The Global Unique Identifier (GUID) Capability provides a persistent, machine resolvable identifier platform for all FAIR objects in the Commons, fully aligned with community practices, recommendations, and metadata models. 2 - The Cloud Dataverse for Biomedical Digital Object Publication, Citation, Replication, and Reuse applies FAIR principles to primary and derived datasets, computational provenance, and software, making them fully FAIR compliant, while documenting the production processes. Cloud Dataverse integrates with multiple cloud computing solutions. As an example, with these capabilities, a researcher can extract a subset of data from TOPMed or GTEx and publish it in Cloud Dataverse, with its associated metadata, provenance, and terms of use. In publications, she can cite the data and software according to Data Citation Publishers guidelines [Cousijn2017] and Software Citation Principles [Smith2016]. Other researchers can access the dataset using the GUID in the citation, resolving the repository?s dataset landing page (as recommended by the Data Citation Principles [Martone2014, Fenner2017, Starr2015]). From this landing page, researchers can repeat the original calculation, perform new computations on the dataset, or use the provenance graph to learn how the dataset was created. The data and software published in the repository reflect the evolving nature of research; anyone can publish new versions with the provenance documenting the process. Our open-source software platforms used in production, adhere to FAIR principles, and provide the basis for the two capabilities: GUIDs and Cloud Dataverse enabling the use cases above. We will expand upon them, produce new tools, and reach a wider community. Currently GUIDs and provenance describe datasets; we will generalize these mechanisms to support other digital objects, focusing on software in the pilot phase. We will connect and harmonize DataCite, identifiers.org, and N2T/EZID services to provide GUIDs; and integrate the Dataverse repository software with the Massachusetts Open Cloud, built on top of the OpenStack cloud platform, and public commercial clouds (Microsoft Azure, Google Cloud) to provide Cloud Dataverse. Our past experience producing sustainable services for overlapping communities of developers and users demonstrates our ability to apply our expertise to supporting the larger and more diverse NIH Data Commons user community.
View original record on NIH RePORTER →