CAREER: Long-Tailed Learning in the Open and Dynamic World: Theories, Algorithms, and Applications

$444,000FY2024CSENSF

Virginia Polytechnic Institute And State University, Blacksburg VA

Investigators

Abstract

A common and fundamental property of real-world data is the long-tailed distribution; that is, where the majority of examples come from a few key categories (majority classes), while the rest of the examples belong to a massive number of tail categories (minority classes). This data description fits across a wide range of domains, including financial fraud detection, e-commerce recommendation, scientific discovery, and rare disease diagnosis. Although there has been considerable research on long-tailed learning, the vast majority has been conducted in an artificial, closed environment with predefined domains, data distributions, and downstream tasks. A natural and fundamental research question largely remains nascent: How can we take this research one step further to enable Open-World Long-Tailed Learning (OpenLT), where the domains are heterogeneous, open-ended, and evolving over time? Building upon the existing observatory work, this project aims to develop fundamental theories and algorithms for OpenLT. To be specific, there are three research thrusts. The first thrust aims to develop fundamental theories for a better understanding of the OpenLT problem. The second thrust aims to create a generic computational framework for heterogeneous long-tailed data in the wild. The third thrust systematically validates and verifies the theories and techniques from the first two thrusts on high-impact applications, including financial fraud detection and rare disease diagnosis. Upon completion, this project will advance the state of the art in long-tailed learning in two key dimensions. First, it will establish theoretical foundations for OpenLT, encompassing the unification of long-tailedness measurements, reliability analysis, and generalization bound analysis, most of which are currently absent in the existing literature. Second, it will lead to a generic OpenLT computation framework with novel pre-training, fine-tuning, and adaptation techniques, which is anticipated to exhibit substantial improvements in open and dynamic environments. The research outcomes will be integrated into a variety of educational activities during and beyond the course of this project. Leveraging various supporting programs at Virginia Tech, the investigator will ensure that students at different levels (e.g., K-12, undergraduate, and graduate students) have the opportunity to learn from and participate in the advancements brought forth by this research. The research findings will be integrated into the machine learning and data science courses taught by the investigator and disseminated through various channels, including paper publications, conference tutorials, workshops, and potential technology transfers. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →