Efficient Approaches to Summarize Sparse & Dynamic Datasets

$522,000FY2003CSENSF

University Of California-Santa Barbara, Santa Barbara CA

Investigators

Abstract

On Line Analytical Processing tools are increasingly being used in diverse applications that range from business applications, earth science applications, to digital libraries. Such applications need to deal with sparse and very large data sets. Furthermore, such data is updated in append-only manner. In this proposal novel summarization and aggregation techniques are being developed for high-dimensional datasets which are sparse and are updated in append-only manner. These techniques are multi-resolution in nature, and exploit the efficiency with which disks can read sequentially stored information. Iceberg CUBES, which have proved to be particularly beneficial for sparse data cubes, are also being efficiently computed and materialized. Such sparse data cubes are represented using ranges, and approximations are derived by maintaining information regarding top-k and bottom-k elements. Multidimensional data poses significant challenges not only in terms of storage and retrieval but analyzing such data becomes a fundamental problem. The focus of this research is on the issue of developing efficient representations for very high-dimensional data that are both sparse and dynamic. Efficient representations will enable fast analysis of high-dimensional data specially in the context of spatial and temporal data, high resolution images, and time sequences. The research is timely and is likely to have a profound impact on the development of efficient analysis tools for large high-dimensional datasets. The research results will contribute towards the design and development of next generation of on-line analytical processing tools sorely needed both in industrial as well as scientific communities. Currently, earth scientists often need to scale down earth-science computational models due to the complexity of spatial joins for large datasets. Similarly, datacubes for high dimensional datasets are avoided by analysts. The tools and algorithms produced will be a step towards alleviating many of these problems. The PIs frequently interact with members of the local high-tech industry to provide necessary guidance for solving problems related to the scalable management of high dimensional data. The research results will directly contribute to such efforts. The research will also serve as a vehicle for the advanced training of graduate students and the software developed will be used in both graduate and undergraduate education.

View original record on NSF Award Search →