Collaborative Proposal-Using the Web as a Corpus for Empirical Linguistic Research

$165,000FY2001CSENSF

University Of Maryland, College Park, College Park MD

Investigators

Abstract

. This project will develop tools that make it possible to retrieve naturally occurring sentences from the World Wide Web on the basis of lexical content and syntactic structure, providing linguists with an immediate, easily accessible source of raw linguistic data. The PIs will investigate specific linguistic hypotheses at the lexical semantics/syntax interface as an illustrative application of these tools. At a high level, the planned work constitutes an important step toward a new paradigm for linguistic research. Rather than relying entirely on introspective data generated by the linguist who is trying to (dis)prove a particular hypothesis, Web-enabled linguistics research will draw on the methodology and the tools developed by the PIs to supply naturally occurring data on which theories can rest. With regard to specific linguistic questions, the goal is to provide an explanation of the rules and constraints that govern three transitivity alternations (Middle, Unaccusative, Unspecified Object Deletion), and the PIs expect data made available by their tools to shed light on the "grey" area between competence and performance, that is, the linguistic behavior that seems to fall outside of rule-governed behavior. Although naturally occurring data are not accorded great emphasis in generative syntax, the use of text corpora has a tradition in the greater linguistic enterprise. An explosive new phenomenon in the world of naturally occurring text, the World Wide Web is an essentially untapped resource that embodies the rich and dynamic nature of language, presenting a data resource of unparalleled size and diversity.

View original record on NSF Award Search →