Manifold Coordinates with Physical Meaning
University Of Washington, Seattle WA
Investigators
Abstract
Finding low dimensional meaningful descriptors for high-dimensional phenomena has been one of the motors of scientific discovery. For instance, the "genetic maps" introduced by Cavalli-Sforza map the variation of the human genomes into 2-dimensional geographic locations, charting prehistorical migrations. In this example, the scientists' intuition guided the mapping of the genomes on the spatial dimensions. This project will develop a general statistical framework to expand and automate this process. A scientist provides a list of descriptors with scientific meaning, that could be used to unfold the data in low dimensions. This list is called a "dictionary". The dictionary mediates between expert knowledge, expressed with the concepts of the scientific domain, on one hand, and the lower level representations of the data used by learning algorithms, on the other. This project will facilitate the tranfer of knowledge between scientist and machine. A statistical learning algorithm will replace the task of manually checking individual descriptors for correlation with the data variation; the algorithm will perform this task on the whole dictionary at once. The output is a small set of descriptors from the dictionary, which together capture most of the variation in the data; we call them Interpretable Embedding Coordinates (IEC). Unlike Principal Components or Principal Directions, which are abstract, these coordinates are always meaningful and interpretable, because they are selected from the dictionary of descriptors supplied by the scientist. In this project it is assumed that the data lie on or near a smooth low-dimensional manifold; the dictionary consists of smooth functions on the manifold. The new method finds coordinates in the manifold among the interpretable, meaningful functions in the dictionary. Interpretable Embedding Coordinates (IEC) will be formulated as a non-parametric, non-linear sparse functional regression problem. The main idea is to tranform this problem into a linear sparse regression in the space of function gradients. This allows one to apply the well-developed aresenal of sparse recovery methods to IEC, without sacrificing the original non-linearity of the problem. Statistical and geometric guarantees for recovery will be given. The new methods will be integrated into the big data unsupervised learning platform megaman, distributed and maintained by Meila's group. PI Meila, with support from the UW eScience Institute, will disseminate the ideas and methods in an on-line Active Training Lab on Unsupervised Learning. This project is part of Meila's current research program "Unsupervised Validation for Unsupervised Learning" to design mathematically founded methods to interpret, verify and validate the output of machine learning algorithms for scientific data and scientific discovery. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →