Collaborative Research: OAC: Approximate Nearest Neighbor Similarity Search for Large Polygonal and Trajectory Datasets
University Of Texas At San Antonio, San Antonio TX
Investigators
Abstract
Similarity searches are a critical task in data mining. Nearest neighbor similarity search over geometrical shapes - polygons and trajectories - are used in various domains such as digital pathology, solar physics, and geospatial intelligence. In digital pathology for tumor diagnosis, tissues are represented as polygons and Jaccard distance - ratio of areas of intersection to union - is used for similarity comparisons. In solar physics for predicting solar flares, the query object and the dataset is made up of polygons representing solar events. In geospatial intelligence, similarity search is used to geo-locate a shape or a contour in global reference datasets. The current literature, while rich in methods for textual and image datasets, is lacking for geometric datasets. This project will develop scalable similarity search systems on polygonal and trajectory datasets. It will produce benchmark datasets of polygonal queries and responses for the research community and inform the data mining techniques which employ similarity primitives. It will help introduce student projects for courses on parallel, distributed, high performance, and data intensive computing, data mining, and spatial computing. This will also train PhD students, including those at a Hispanic Serving Institution. Given the ever increasing size of datasets, exact nearest neighbor searches requiring a scan of the entire dataset quickly become impractical, leading to approximate nearest neighbor searches. Traditional methods, such as using trees, suffer from the constraints of dimensionality. Approximate similarity search is required for scalability in processing large numbers of queries, index construction over big spatial data, and to address the dynamic nature of data itself. This project will explore approximate similarity search algorithms based on product quantization and locality sensitive hashing (LSH) techniques for 10-100 billion scale datasets. It will result in (i) new methods for creating robust signatures of geometric data, based on comprehensive exploration of the performance/accuracy tradeoffs among different encoding schemes, informed by spatial properties of the data and requirements of relevant distance metrics, (ii) scalable coarse quantization techniques to hierarchically organize the polygonal datasets into neighborhoods by preserving hyperspace locality properties, leading to product quantization based scalable systems, and (iii) LSH-based techniques focusing on designing LSH functions for Jaccard distance. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →