Collaborative Research: CRI: CRD: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu

$733,029FY2008CSENSF

University Of Colorado At Boulder, Boulder CO

Investigators

Abstract

Treebanks are corpora of naturally occurring text that have been annotated with morphological and syntactic (structural) information. In the last 15 years they have led to significant advances in natural language processing (NLP) results by providing training data for supervised machine learning algorithms. These algorithms can now automatically perform useful part-of-speech tagging, parsing and semantic interpretation. This project is creating a new-generation, multi-representational Treebank. The languages being annotated are Hindi (400K words) and Urdu (200K words). The texts are being annotated in dependency structure (trees in which all nodes are labeled with words of the sentence), enriched with additional semantic role labels. The dependency representation is also being automatically mapped to a phrase-structure representation (in which the words are at the leaves of the tree and internal nodes are labeled with phrase markers). After applying standard quality-control both versions will be released to the public, providing an immediate boost to the performance of Hindi/Urdu NLP. A tool will also be released that will allow a researcher to produce alternative formatting of the phrase structure representation. This supports a view of the treebank as a more general, abstract representation of the morphology and syntax of the language rather than merely as data for a particular style of machine learning experiment. Research into parsing and other NLP tasks has recently recognized the benefits of reformatting syntactic representations in order to improve the machine learning process; this treebank will make that step much easier for all NLP researchers interested in Hindi or Urdu in particular and in language in general. OISE is co-funding the University of Colorado student exchange with the IIIT in Hyderabad, India where 400K words of Hindi and 200K words of Urdu will be annotated with dependency parses. This will enable an international research experience for U.S.students.

View original record on NSF Award Search →