Improving Probabilistic Record Linkage and Subsequent Inference

$204,616FY2017SBENSF

University Of Texas At Austin, Austin TX

Investigators

Abstract

This research project will develop methods for linking records across databases in the absence of unique identifiers such as Social Security numbers and for making inference using the linked data files. Record linkage is a perennial and challenging problem across the social sciences, with important applications in areas such as demography, economics, public health, and official statistics. Plummeting costs of new forms of data collection and storage and the proliferation of "big data" have increased the need for merging such databases as researchers and statistical agencies struggle to integrate carefully curated datasets with messy and incomplete data from historical, administrative, and commercial sources. The methods developed in this project will facilitate the successful integration of different data sources, thus generating new resources for future research. These combined data sources may also provide some alternatives to expensive survey data collection in an era of declining response rates. Freely available software will be developed and stored in a public repository. The increasing desire to deploy probabilistic record linkage has spurred significant research into various components of the process, such as how to compare records, how to reduce the number of record comparisons to keep the problem computationally feasible, how to quantify the weight of evidence for or against a link between records, and how to ultimately generate a merged database. Often these components are studied in isolation from each other and from the ultimate goal of making inferences using the merged files. This research project will take a more holistic view of the record linkage process in order to advance the state of the art. The project has two primary goals. The first goal is to develop new models for record linkage that incorporate the impact of preprocessing methods that reduce the total number of record pairs to be evaluated. While widely deployed and well motivated, these methods have effects on subsequent modeling that are not well understood. The second goal is to enhance understanding of uncertainty and error throughout the process and to develop imputation methods for propagating error due to uncertain record links and other missing data, such as item nonresponse in a survey. These methods will be designed with an eye toward large applications that require new computational approaches. The project is supported by the Methodology, Measurement, and Statistics Program and a consortium of federal statistical agencies as part of a joint activity to support research on survey and statistical methodology.

View original record on NSF Award Search →