CI-P: Planning for Scalable Language Resource Creation through Novel Incentives and Crowdsourcing

$99,754FY2016CSENSF

University Of Pennsylvania, Philadelphia PA

Investigators

Christopher M Ciericontact Chris Callison-Burch Mark Liberman

Abstract

Advances in human language technologies enable systems that, for example, obey natural language commands and respond in kind, translate among many language pairs and summarize multilingual news. However, the technology's potential remains largely untapped because the linguistic resources that fuel development still fall far short of need. This community infrastructure planning (CI-P) initiative begins the process of building infrastructure to continuously develop high quality language resources, by employing techniques proven to work in multiple scientific disciplines. Social media, crowd-sourcing, games with a purpose and citizen science show us that human resources are effectively limitless for some activities. By offering human contributors appropriate opportunities and incentives, this project enhances language resource development well beyond what direct funding alone can produce. By removing constraints on participation, designing activities to appeal to multiple communities the project creates educational opportunities for the public including students and under-represented groups. The increase in scale and diversity of data also benefits those working in language related research, education and technology development. The availability of an ever-growing body of resources for an expanding range of languages will permit developers to supply technologies to a greater proportion of the world. This project is the first step in the creation of infrastructure capable of high volume, continuous collection of language data and judgments through: ubiquity, perseverance, comprehensive annotation, automated training and certification, appropriate incentives, task engineering and variants of crowdsourcing. Building upon Linguistic Data Consortium's WebAnn framework, virtual front end web servers provide multiple interfaces to incentivize and engineer linguistic data contributions from targeted groups: linguists, citizen scientists, game players and students. Collection and annotation activities are analyzed into component tasks according to the skills they require and are assigned as appropriate to different workforces using different workflows. The combination of customized interfaces and novel incentive strategies enables ongoing, scalable data collection and annotation resulting in diverse language resources available to the wider Computer and Information Science and Engineering research and education communities.

View original record on NSF Award Search →