EAGER: DCL: SaTC: Enabling Interdisciplinary Collaboration: Efficient Human-in-the-Loop Redaction of Language Development Corpora

$299,992FY2022CSENSF

University Of Chicago, Chicago IL

Investigators

Blase Urcontact Chenhao Tan Marisa Casillas Susan J Goldin-Meadow

Abstract

At great effort and expense, and with the cooperation of hundreds of parents, teachers, and children, researchers have collected conversation transcripts to study topics like children's language development. The data most useful for science are longitudinal and naturalistic, such as data collected periodically over time in children's homes. Unfortunately, the longitudinal, naturalistic corpora most likely to advance knowledge may contain information that renders participants identifiable. For this reason, naturalistic corpora are rarely shared with other researchers, hindering science. Sharing requires careful redaction--the removal of potentially identifying information. Currently, naturalistic corpora are often too large for manual redaction, and current automated tools both miss critical redactions and over-redact important information. To enable such data to be shared, this project seeks to develop novel computational methods for redaction. This project's aim is to develop initially automated, human-in-the-loop redaction of identifying information in unstructured text data. First, to better understand key challenges around what aspects of transcripts make participants identifiable, the researchers are conducting interviews with social and behavioral science researchers and members of ethics boards. From these insights, the researchers are developing novel models for predicting what language may need to be redacted and they are designing novel user interactions for leveraging human expertise in redaction decisions. The unique characteristics of conversation transcripts require modeling novel features of language, drawing from natural language processing, psychology, privacy engineering, and linguistics. Because automated methods lack human insights into conversational context for making complex redaction decisions, the researchers are designing user interfaces that summarize how marked language, or tokens, appear longitudinally in transcripts, enabling human coders to quickly make redaction decisions. As a case study, the researchers are applying these techniques to the Language Development Project, a longitudinal corpus of 100 diverse children's development of language. The project is also training students in multidisciplinary research across the computational and social sciences. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →