EAGER: Exploring Adapting Language Technology Across a Network of Domains

$100,000FY2013CSENSF

Georgia Tech Research Corporation, Atlanta GA

Investigators

Abstract

Much of the most successful software for processing and understanding natural language is based on learning from labeled examples. However, applications to diverse genres such as social media and historical documents have demonstrated the limitations of this approach since the application data differs dramatically from the training examples. Labeling training datasets for each new genre is prohibitively expensive. Methods that adapt the software between the original source domain and the target --- for example, from 20th century newspapers to Shakespearean drama --- are an attractive alternative and an active research area. However, language does not naturally fall into a few source and target domains; rather, documents exist in a multidimensional field of similarity and difference, based on metadata attributes such as the date of publication. In addition, binary source/target adaptation ignores vast amounts of unlabeled data that may bridge the gap between, say, the 20th and 17th centuries, or between text from the Wall Street Journal and text entered on Twitter. This EAGER award explores a new approach to adapting language technology to new application domains. Using explicit document metadata such as date of authorship (for historical documents) or product type (for online reviews), documents are situated in a network of fine-grained domains. Micro-adaptation is then performed between adjacent nodes in the network, which are expected to be more similar to each other than (distant) the source and target domains. These micro-adaptations can then be propagated across the domain graph, yielding an adaptation path from source to target. Empirical evaluations will compare this approach to the current state-of-the-art practices: adapting directly from source to target, and adapting from the source to a broader set of non-source documents. In addition, a theoretical analysis will identify conditions under which this approach is likely to succeed. Language technology already impacts society by facilitating the retrieval, organization, and summarization of information, but its inability to transcend a small set of training domains is one of the most critical obstacles to more widespread adoption. Key application domains such as social media, patient medical records, and legal documents differ substantially from available training corpora, and the development of effective technology for these areas depends on bridging the domain gap. In addition, the sociocultural variation found in online language dramatically reduces the performance of state-of-the-art systems, creating a "language gap" between standard and minority dialects. This research is not tied to any specific language processing task; rather, it promises to build a more robust foundation that can apply across many tasks, bringing the benefits of language technology to new users and settings.

View original record on NSF Award Search →