COINSTAC: decentralized, scalable analysis of loosely coupled data

$680,020R01FY2016DANIH

The Mind Research Network, Albuquerque NM

Investigators

Linked publications & trials

Abstract

? DESCRIPTION (provided by applicant): The brain imaging community is greatly benefiting from extensive data sharing efforts currently underway5,10. However, there is a significant gap in existing strategies which focus on anonymized, post-hoc sharing of either 1) full raw or preprocessed data [in the case of open studies] or 2) manually computed summary measures [such as hippocampal volume11, in the case of closed (or not yet shared) studies] which we propose to address. Current approaches to data sharing often include significant logistical hurdles both for the investigator sharing the dat as well as for the individual requesting the data (e.g. often times multiple data sharing agreements and approvals are required from US and international institutions). This needs to change, so that the scientific community becomes a venue where data can be collected, managed, widely shared and analyzed while also opening up access to the (many) data sets which are not currently available (see recent overview on this from our group2). The large amount of existing data requires an approach that can analyze data in a distributed way while also leaving control of the source data with the individual investigator; this motivates dynamic, decentralized way of approaching large scale analyses. We are proposing a peer-to-peer system called the Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation (COINSTAC). The system will provide an independent, open, no-strings-attached tool that performs analysis on datasets distributed across different locations. Thus, the step of actually aggregating data can be avoided, while the strength of large-scale analyses can be retained. To achieve this, in Aim 1, the uniform data interfaces that we propose will make it easy to share and cooperate. Robust and novel quality assurance and replicability tools will also be incorporated. Collaboration and data sharing will be done through forming temporary (need and project-based) virtual clusters of studies performing automatically generated local computation on their respective data and aggregating statistics in global inference procedures. The communal organization will provide a continuous stream of large scale projects that can be formed and completed without the need of creating new rigid organizations or project-oriented storage vaults. In Aim 2, we develop, evaluate, and incorporate privacy-preserving algorithms to ensure that the data used are not re-identifiable even with multiple re-uses. We also will develop advanced distributed and privacy preserving approaches for several key multivariate families of algorithms (general linear model, matrix factorization [e.g. independent component analysis], classification) to estimate intrinsic networks and perform data fusion. Finally, in Aim 3, we will demonstrate the utility of this approach in a proof of concept study through distributed analyses of substance abuse datasets across national and international venues with multiple imaging modalities.

View original record on NIH RePORTER →