CAREER: Mitigating the Lack of Labeled Training Data in Machine Learning Based on Multi-level Optimization

$370,000FY2024CSENSF

University Of California-San Diego, La Jolla CA

Investigators

Abstract

Machine learning has demonstrated great success in numerous applications such as autonomous driving, early detection of diseases, drug design, etc. The accuracy of machine learning models highly depends on the accessibility of large-scale, human-labeled training data. However, such data is often very challenging to acquire in specialized domains such as healthcare, legislation, environmental sciences due to the high costs involved in obtaining high-grade human labels and data privacy concerns. This project will advance science by providing algorithms, software, and systems that can automatically generate high-quality labeled data to mitigate the lack of labeled training data in specific domains and and allow training of highly accurate machine learning models. The project will significantly broaden the applicability of machine learning across various application areas by lowering data barriers and will substantially reduce the labor costs of manual data annotation. For example, it will promote scientific discovery in structural biology and high-energy physics and streamline engineering design in wireless communication. It will facilitate the early detection of sepsis, lung cancer, Parkinson's disease, and sleep apnea, improving patient outcomes and quality of life. Applied to compound design and cement production, the developed technologies have the potential to expedite drug discovery and reduce energy consumption. To achieve the goal of creating high-quality labeled training data, this project will develop three complementary paradigms of novel approaches based on multi-level optimization and large language models, for: 1) end-to-end generation of labeled data; 2) annotation of unlabeled data; and, 3) example-specific adaptation/selection of labeled source data, respectively. First, the proposed data generation methods will leverage the worst-case and class-specific performance of downstream models to provide end-to-end and fine-grained guidance for generating data (with complex labels) that is tailored to improve the accuracy and robustness of downstream models, and to promote balanced performance across different classes. Second, the proposed data annotation methods will leverage an end-to-end mechanism that capitalizes on large language models, a sequence of verification procedures, and available side information to maximize the accuracy of generated labels. Third, the proposed adaptation/selection methods will distinguish between source examples that are inside or outside of a target domain and subsequently determine an example-specific adaptation/selection action end-to-end to ensure optimal use of source data. In addition, the proposed novel optimization algorithms and distributed systems will effectively tackle new challenges related to multi-level optimization, including non-differentiability, incompatibility with the optimizers of large language models, and scalability. This project represents the first one systematically leveraging multi-level optimization to create labeled data, effectively addressing a fundamental knowledge gap that existing methods often lack capabilities to perform end-to-end execution of multiple learning stages and therefore fall short in tailoring generated data to improve downstream models’ performance. Another significant innovation of this project is its effective harnessing of large language models for data annotation, which will substantially reduce the costs of manual labeling. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →