RI: Small: Learning Meaning and Grammar from Interaction, Context, and the World

$150,000FY2012CSENSF

Stanford University, Stanford CA

Investigators

Abstract

Natural language processing tasks like question answering or machine translation require sophisticated parsers: systems that extract grammatical dependency relations between words. But traditional supervised methods of training parsers rely on very expensive hand-labeled datasets, and generalize poorly to new words, grammar, languages, or genres of text. This project is pursuing three directions to significantly augment current unsupervised models of grammar induction. First is a new mathematical model of dependency parsing that draws on linguistic intuitions of constituency. Second is an architecture that jointly learns grammar and parts-of-speech, eliminating the need for supervised part-of-speech tags and hand-labeled datasets, and making grammar induction possible on a vast number of languages and genres. Third are ways to exploit new sources of data for unsupervised learning, including anchor text in web data, vastly expanding the scope of the problem from the small clean annotated treebanks commonly used in current work. Language understanding by machine is a crucial tool for our nation: machine translation makes international web sites broadly accessible, sentiment analysis helps newspapers make politics more transparent, question answering systems help people disseminate knowledge, and information extraction helps corporations and people draw insights from vast databases of documents. By improving the fundammental parsing technology that underlies each of these tasks, and making it possible to parse new languages and genres that have not been parsable before, this project has the power to vastly increase both the power and scope of these key applications.

View original record on NSF Award Search →