CAREER: Novel Summarization Techniques for Semi-Structured Data
University Of California-Santa Cruz, Santa Cruz CA
Investigators
Abstract
The overall goal of this project is to explore novel summarization models for semi-structured data, in order to enable approximate query answering within data-centric and document-centric databases. The following key problems are investigated: (a) Approximate query answering for data-centric XML databases. The project explores summarization models that capture both the structural and the value-based characteristics of a semi-structured data set. The developed models take into account the heterogeneity of the value content and can be dynamically maintained when the base data is modified. (b) Document-centric XML summarization. A summarization model is explored that enables the approximate computation of distance, under various metrics, between a summarized document and an input query. The project investigates the application of this model in approximate query answering for document retrieval queries, as well as its integration within the query processing engine for the fast computation of candidate matches. (c) Relational synopses for join queries. A novel class of relational synopses is explored that uses summarization models for semi-structured data in order to approximate the statistical characteristics of a relational database. The results of this research are applicable in areas that involve the exploration of large semi-structured data sets, e.g., scientific databases or digital libraries. To facilitate the evaluation of the proposed techniques in these target areas, system prototypes will be developed and be made publicly available. The project will train graduate and undergraduate students on data reduction and approximation techniques, while the resulting publications, collected data sets, and developed software will be made available on the web for broader dissemination. http://www.cs.ucsc.edu/~alkis/APX/
View original record on NSF Award Search →