SGER: Annotated Corpus of Japanese Telephone Conversations

$18,902FY2000SBENSF

Stanford University, Stanford CA

Investigators

Abstract

This project will result in the creation and distribution of a linguistically annotated version of an existing corpus of Japanese telephone conversations, the CallHome Japan (CHJ) corpus. The CHJ corpus, released in 1996 by the Linguistic Data Consortium (LDC), consists of digitized speech data and text transcriptions of 120 spontaneous, unscripted telephone conversations in Japanese. While the transcriptions are of high quality, the usefulness of the CHJ corpus for language research is limited by its lack of any syntactic, semantic, prosodic, or discourse annotations, including part-of-speech (POS) tags. This project will address this deficiency by creating a new version of the CHJ corpus that includes useful POS, semantic and acoustic annotations. In particular, the following annotations will be provided for the entire corpus: 1. POS tags, using the LDC's inventory of 60 POS and morphological tags for Japanese; 2. Lexical semantic tags on verbs and nouns, using NTT's 400,000-word Goi-Taikei semantic lexicon and ontology. The annotated corpus will be made available to the general research community through the normal channels of the Linguistic Data Consortium. This annotated corpus will stimulate and benefit research in corpus linguistics, natural language processing, Japanese linguistics, discourse and conversation analysis, variation (crosslinguistic, cross-cultural, and gender), and the prosody/syntax/semantics interfaces.

View original record on NSF Award Search →