GGrantIndex
← Search

The Complexity of Testing Distributions

$200,000FY2005CSENSF

Massachusetts Institute Of Technology, Cambridge MA

Investigators

Abstract

In a wide variety of computational settings, where the data is most naturally viewed as coming from a distribution, it is often crucial to determine whether the underlying distribution satisfies various properties. Examples of such properties include whether two distributions are close or far in statistical distance, whether a joint distribution is independent, and whether a distribution has high entropy. For most such properties, standard statistical techniques which approximate the distribution lead to algorithms which use a number of samples that is nearly linear in the domain size. Until very recently, distributions over large domains, for which linear sample complexity can be daunting, have received surprisingly little attention. However, new interest in these questions comes from many directions, including data mining, Physics and machine learning applications in Neurobiology. Recent results have shown that one can achieve results which are significantly more efficient than the standard techniques for the case of large domains. The intellectual merit of this research will lead to an understanding of the sample, time and space complexity required to identify various natural properties of a probability distribution. The proposed research will focus on determining which properties can be understood with a number of samples that is sublinear in the domain size, and will lead to an understanding of the aspects of algorithm design that are specific to these constraints. The questions that will be considered range from considering the complexity of testing previously unstudied properties, finding general techniques which apply to classes of distribution testing problems, investigating new models of distribution testing, understanding structural aspects of the testing problems that can be solved in sublinear time, and further understanding the relationship between the computational complexity and sample complexity. The broader impact of this proposal includes educational and workforce development components. The educational component of this proposal involves designing course material on algorithms for testing distributions that would be appropriate for students at all levels. Enough material has been collected to develop a graduate course that highlights the body of techniques that have emerged in this field. Some of the recent advances are a perfect fit for conveying fundamental ideas behind randomized algorithms to undergraduates. The PI has been awarded two teaching awards for her efforts at undergraduate education. The PI will continue her efforts as an advisor and mentor to undergraduates, graduate students and postdoctoral researchers, which in the past have included several women who have gone on to successful research careers. The PI is currently co-organizing a Dagstuhl workshop on Sublinear Algorithms. A priority has been placed on inviting and supporting graduate students interested in working in the area. Finally, the ideas will also be disseminated through the use of the web, seminar, workshop and conference presentations, journal articles and survey articles aimed at a wider audience.

View original record on NSF Award Search →