Large-Scale Web Research Testbed

$439,828FY2003CSENSF

Stanford University, Stanford CA

Investigators

Hector Garcia-Molinacontact Christopher D Manning Dan Boneh Jennifer Widom Rajeev Motwani

Abstract

This project, establishing a very large repository of current and historical Web content, supports two group research efforts: Management of the Web content and Analysis and mining of the content. The current facilities have been instrumental in examining many aspects of the World Wide Web (WWW). These aspects include experimentation toward understanding, optimally utilizing, and improving the Web. The facility has enabled researchers to try out various hypothesis and techniques for indexing and modeling WWW information. The system's highly configurable crawlers collect a large number of Web pages, storing them locally for testing novel algorithms, such as ranking, filtering, or Web linkage mapping on the collection. The current WebBase is underpowered; for example, the crawling speeds are limited by CPU performance (retrieved pages are compressed before being stored), and often by virtual memory space. Removing these two bottlenecks will enable sustaining a higher Web sample rate and covering larger areas of the Web. An upgraded testbed, developed by scaling up in size and processing speed of the current hardware facilities of an existing system called WebBase, will be used to study and evaluate different Web crawling, archive refreshing, data compression, and storage and indexing techniques. Moreover, the project investigates problems related to data extraction, semantic search, searching for non-text objects, access control, cross-temporal analysis, and mining patterns or relationships between entities. Problems to be addressed include: How to Collect ever-growing amount of Web data, and keep it up to date, Provide improved search capabilities over such data, better exploiting the semantics of data and user requests, Efficiently process high-volume real-time data streams, Organize a Web archive that captures the "history" of the Web, and Deal with new types of sources (e.g., the hidden web or chat rooms) and new types of data (e.g., images). In addition, the new WebBase facility will support teaching at various universities by providing a testbed where the students can develop new searching, indexing, and user presentation ideas. WebBase draws together faculty in the areas of data mining, security, natural language processing, and database systems; consequently, the areas enhance each other. Thus, the infrastructure will support: Experimental research in a critical area: management and exploration of Web information; Researchers at institutions that do not have sufficient facilities for large-scale Web crawling; and Teaching of courses on information retrieval and data mining.

View original record on NSF Award Search →