Collaborative Research: CRI: An Open Linguistic Infrastructure for American English

$100,875FY2006CSENSF

Vassar College, Poughkeepsie NY

Investigators

Abstract

This project, supporting research into computational linguistics, plans the enhancing of the American National Corpus (ANC) with an open linguistic infrastructure that will add multiple manual and automatic annotations to a portion of ANC and will provide free access to these annotations in a common XML data format via a project website. The following activities are envisioned: -Incorporation of automatic annotations derived from freely existing tools, mapped into the ANC XML format language, -Syntactic and named entity annotations of a 10Mw gold standard corpus, with partial manual annotation, -Hand-corrected automatic WordNet and FrameNet annotation for a portion of the gold standard corpus, -Enhancement of automatic annotation performance via experimentation with machine learning techniques, and -Development of a web interface for users to download above annotations, and to upload new annotation of ANC. This work, describing methods for internal and external evaluation of the resources and tools developed, plans to create a richly, multiple annotated diverse corpus of natural language, and tools to access it. The full project would be the first large-scale execution of such effort, developing a 100 million word ANC and providing a 10-million word subset, annotated with syntax, named entities, and semantic categories in WordNet (WN) and FrameNet (FN). The annotated data will be balanced from different genres of text. One of the activities of the planning award consists in harmonizing all three resources, ANC, WN, and FN, and maximally exploiting their respective strengths. The other involves the continued development of the ANC, which, with the addition of a wide range of linguistic annotations, will serve as a resource for language processing research and applications for the NLP community. The planning project undertakes the following activities: -Creation and annotation of WN senses and FN frames, -Planning meetings, -Further research into experimentation with methods and software to enhance automatic annotation, and -Outreach to the US computational linguistics community. Broader Impact: Full completion of this work will further enhance the ANC by creating a comprehensive linguistic infrastructur for American English. The availability of a massive, richly annotated corpus of American English has impacts at many levels and across several areas, including computational linguistics and natural language processing, corpus linguistics, cross-linguistic studies, dialect studies, language acquisition, and materials development for both English language students and teacher training.

View original record on NSF Award Search →