III: Small: Collaborative Research: Algorithms, systems, and theories for exploiting data dependencies in crowdsourcing

$250,000FY2020CSENSF

Lehigh University, Bethlehem PA

Investigators

Abstract

Data are abundantly available to encode knowledge in many domains, such as biomedical research, online commerce, open government, education, and public health. Machine learning is a powerful tool to discover novel knowledge from data and to help individuals and organizations make informed decisions. However, machine learning needs to be bootstrapped by human-annotated knowledge, which can be expensive to obtain and also contain human errors. The team of researchers discovers and exploits the dependencies in the data, via novel methodologies to significantly reduce the cost and noises when providing critical knowledge for machine learning. The research outputs, including algorithms, systems, and theories, are sufficiently generic to benefit many domains where machine learning is applicable. By conducting the fundamental research, the team will train undergraduates and graduates for the STEM workforce in the nation. The researchers will collaborate to develop algorithms, systems, and theories for reducing costs and noises when annotating dependent data, termed as “structured annotations”, to provide supervision knowledge for machine learning. While the dependencies can make data annotations costly and error-prone, the researchers view the dependencies as a useful inductive bias for selective and accurate annotations. In particular, the research team proposes a human-in-the-loop system to aid the construction of proper probabilistic graphical models to encode the dependencies. The project team combines contextual and multi-armed bandits with scalable graph inference algorithms to reduce labeling costs. Based on the graphical bandits, the team addresses the budget allocation when querying labels of the same data point repetitively for robustness. With noisy human annotations, the team formulates optimization problems and algorithms to jointly infer the annotator competences and the ground truth labels of the data. From the theoretical perspective, the project will advance the active learning in crowdsourcing settings with more realistic noise distributions and will analyze the regrets in structured annotations. The project will result in datasets, algorithms, and a testbed system that benefit not only the core machine learning research community but also many domains that use machine learning. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →