AF: Small: Learning and Testing Classes of Distributions

$471,875FY2013CSENSF

Columbia University, New York NY

Investigators

Abstract

A long and successful line of research in machine learning deals with algorithms that learn from "labeled" data, where a target function is assumed to provide a label for each data point. A major focus of theoretical work has been to develop efficient algorithms for learning different classes of target functions. Recent years have witnessed a data explosion across many domains of science and society, but much of this newly available data consists simply of example points (DNA sequences, sensor readings, smartphone user locations, etc) without any labels. A natural model of such scenarios is that data points are generated according to some unknown probability distribution (typically over an extremely large domain). The goal of the proposed work is to study the learnability of different classes of probability distributions given access to samples drawn from the distributions. This is closely analogous to the framework of learning from labeled data sketched above, but with probability distributions playing the role of functions as the objects to be learned. In this project, the PI will perform theoretical research on developing computationally efficient algorithms for learning and testing various natural types of probability distributions over extremely large domains. (Testing algorithms are algorithms which, instead of trying to accurately model an unknown distribution, have the more modest goal of testing whether or not the distribution has some property of interest.) Specific problems the PI will address include: (1) Developing efficient algorithms to learn and test univariate probability distributions that satisfy various natural kinds of "shape constraints" on the underlying probability density function. Preliminary results suggest that dramatic improvements in efficiency may be possible for algorithms that are designed to exploit this type of structure. (2) Developing efficient algorithms for learning and testing complex distributions that result from the aggregation of many independent simple sources of randomness. The algorithms that the PI will work to develop can provide useful modelling tools in data-rich environments and may serve as a "computational substrate" on which large-scale machine learning applications can be developed for real-world problems spanning a broad range of application areas. Other important focuses of the grant are to train graduate students through research collaboration, disseminate research results through seminar talks, survey articles and other publications, and to continue ongoing outreach activities aimed at increasing interest in theoretical computer science topics in elementary school students.

View original record on NSF Award Search →