AF:Small: Nearest Neighbor Search in High Dimensional Spaces

$449,960FY2016CSENSF

Columbia University, New York NY

Investigators

Abstract

The goal of this project is to advance the state of the art of algorithms for the nearest neighbor search (NNS) problem. The NNS problem is one of the central computational problems arising when dealing with modern massive datasets. For example, it underlies a classical classification rule in machine learning: to label a new object (such as an image), one can simply find the most similar objects (the nearest neighbors) in a preprocessed database, and use the label of the objects found. More generally, NNS is a key algorithmic tool in many areas including databases, data mining, information retrieval, computer vision, computational geometry, signal processing, bioinformatics, and others. In such applications, the objects are usually represented in a high-dimensional space: e.g., a 20x20 image is naturally represented by a 400-dimensional vector, with one coordinate per pixel. The PI intends to study the high-dimensional NNS problem, by addressing both foundational algorithmic questions, as well as applied aspects. The project aims for both scientific and educational impact. First, the study of the NNS problem is instrumental in developing fundamental concepts in areas such as high-dimensional computational geometry as well as sublinear space algorithms (including concepts such as dimension reduction, sketching, metric embeddings, etc). Second, due to the numerous applications of NNS, its efficient implementations are used widely in industry. Overall, the PI aims to foster a stronger connection between the theory and practice of NNS by code dissemination, public lectures, and student training. The PI's affiliation with Columbia's Data Science Institute puts the PI in a particularly good position to accomplish these goals. To accomplish the project's algorithmic goals, the PI will leverage the recently developed approach of data-dependent hashing, where the hash function itself adapts to a given dataset. As a proof-of-concept, the PI and co-authors recently demonstrated that this new approach to NNS leads to algorithms outperforming the classical NNS algorithms (such as those based on the Locality-Sensitive Hashing). The project aims to develop this methodology further to its maturity, extend it to other relevant metrics (similarity measures) which have traditionally resisted efficient solutions, develop practical versions of the algorithms, and to understand the limits of these techniques.

View original record on NSF Award Search →