CAREER: WebArchive -- Archiving the History of the Web
University Of California-Los Angeles, Los Angeles CA
Investigators
Abstract
The goal of this project is to build the scientific foundation for archiving the history and evolution of the Web: tracking changes of the Web, compactly storing multiple versions of Web pages, and providing the stored pages to users through an easy-to-access interface. This project comprises both experimental prototype construction and careful study of technical challenges. The experimental effort focuses on building a system that tracks and archives a large number of Web pages over a long period of time. The fundamental technical challenges that are investigated by this project include: (1) a novel mechanism to download pages from Web search interfaces; (2) an efficient way of predicting and detecting changed pages on the Web; (3) a compact representation for storing multiple versions of Web pages; and (4) a novel index structure to support keyword queries on multiple versions of Web pages. Success of this research will have broader impacts on the scientific community and the general public. It will provide a valuable testbed of Web history data that many humanity and engineering researchers can explore and analyze, enabling them to study the "digital trace" of human activity. The project Web site http://webarchive.cs.ucla.edu/CAREER/ will be used for broad results dissemination. In addition, both graduate and undergraduate students are involved in this project, gaining invaluable research experience. The new findings and the software infrastructure from the project are incorporated into new graduate-level courses that are offered by the PI that will help students prepare for a successful career in the general field of computer science that increasingly requires solid understanding of large-scale data collection and management issues.
View original record on NSF Award Search →