RI: Medium: Foundations of Self-Supervised Learning Through the Lens of Probabilistic Generative Models

$1,127,925FY2022CSENSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Pradeep K Ravikumarcontact Andrej Risteski

Abstract

Supervised learning of modern machine learning models requires very large high-quality labeled datasets. Labeling data requires very expensive human annotations, which is often too expensive for under-resourced end-users of machine learning. Unsupervised learning of machine learning models from unlabeled data has the promise to vastly increase the accessibility and inclusivity of modern machine learning. An emerging paradigm for such unsupervised learning is self-supervised learning (SSL), wherein a machine learning model is trained on tasks for which labels can be automatically generated. This approach is at the core of high-performing language and image machine learning models like BERT and DALL-E. However, despite its promise on many benchmarks across diverse domains, a lot of current methodology for developing SSL methods is opaque and heuristic, and evaluation relies on ad-hoc choices of performance metrics. The goal of this project is to build scientific and mathematical foundations of SSL, and consequently also improve its practice. In some of the earliest work in this area, SSL was used to speed up tasks involving the learning of probabilistic models. Progressively, via a series of approximations for scalability, the outputs of SSL could no longer be rigorously tied to probabilistic model parameters, and the goal shifted to learning features that are "useful" for downstream tasks, that is representation learning. "Useful" however can often be mathematically difficult to pin down, so it is frequently not clear (even empirically, much less theoretically) what these methods learn about the data. At present, designing a well-performing SSL method entails trying many combinations of tasks and model architectures, until a particular one gives good results on the downstream tasks. This has two downsides: (i) it requires a substantial amount of trial-and-error; (ii) on a scientific level, it doesn't yield any understanding of what makes a particular task/architecture suitable, and what the features learned capture about the data distribution. This project will repair the severed tie between probabilistic models and feature learning via self-supervised models by analyzing the aspects of a deep generative model that can be recovered via self-supervised learning. Moreover, through this lens, we propose to understand the relative advantages---both statistical and algorithmic---of self-supervised learning methods over other methods for learning probabilistic models. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →