CRI: CRD A Richly Annotated Resource for Language Processing and Linguistics Research

$1,173,089FY2007CSENSF

Vassar College, Poughkeepsie NY

Investigators

Abstract

This project is annotating a corpus of American English for a variety of linguistic features, including syntactic structures and semantic information. The semantic information includes frame information based on FrameNet together with sense information based on WordNet. The annotations in the corpus are manually assigned by human annotators to ensure their reliability. Bootstrapping methods, using portions of the hand validated annotations, are being used to improve the performance of automatic annotation tools. The corpus is drawn from the materials in the American National Corpus, which consists of written data and speech transcriptions generated y native speakers of American English and representing a broad range of genres. All of the annotations are represented in a common format to enable merging different annotation layers, so that interactions among different linguistic phenomena can be studied. The manually annotated corpus will provide an unparalleled resource for computational linguists and linguists who seek to identify patterns of syntactic and semantic usage that can feed the development of language models. This information can be used to train software to automatically annotate unseen data, which in turn enhances applications such as information retrieval and extraction and machine translation. Usage patterns for American English are also invaluable for the development of materials and tools to support English language learning. The resulting corpus and its annotations, together with tools for manipulating the data, will be made freely available for research purposes through the Linguistic Data Consortium.

View original record on NSF Award Search →