CAREER: Scalable Record Linkage through the Microclustering Property

$449,985FY2017SBENSF

Duke University, Durham NC

Investigators

Abstract

Duplicative information across multiple databases is a common problem, whether one is trying to accurately estimate the number of patients who have died from sepsis in the United States, the number of people who live in a congressional district, or the number of individuals who have died in armed conflicts. Before such questions can be answered accurately, duplicated information from databases must be removed in a systematic and accurate way. In the research literature, this process is commonly known as record linkage, de-duplication, or entity resolution. This CAREER award will develop general methods and scalable algorithms for record linkage so that pressing global issues can be addressed in real time or near real time. The modeling and computational tools to be developed will significantly increase the volume of data that can be analyzed. This project will enable researchers to address a broader range of scientific questions and advance research in multiple domains, including precision medicine, official statistics, and human rights. To facilitate these advances and encourage further development, all algorithms will be released as open source software. In terms of education, the investigator will expand the Youth in Machine Learning (YiML) program to enable 50 high school students and 50 undergraduate students per year to participate in the bootcamp and skills-building workshops offered. This will enhance the pipeline of students prepared to study machine learning in future years. At an international level, the investigator will teach workshops at the International Society for Bayesian Analysis Meeting, including a YiML workshop for women. This research project will develop flexible, general Bayesian nonparametric models for record linkage tasks that propagate the amount of linkage error exactly. The project also will develop scalable record linkage algorithms. By drawing on recent advances in clustering, Bayesian nonparametrics, and probablistic dimension-reduction algorithms, this project will advance the state-of-the-art in record linkage. The models and algorithms to be developed will attempt to solve the microclustering problem, which is at the core of this research. In collaboration with domain experts, the investigator will test the new methods using data sets from health care, official statistics, and human rights. The resulting estimates may provide useful information for policy makers in these areas.

View original record on NSF Award Search →