Adding Domain Knowledge to Inductive Learning Methods for Classifying Texts

$200,002FY2000CSENSF

University Of Pittsburgh, Pittsburgh PA

Investigators

Abstract

The objective of this research is to investigate the integration of background knowledge into a machine learning approach for automatically indexing text documents. Case-Based Reasoning models for utilizing past experiences have been developed for domains where the cases are text. The prohibitive cost of manually indexing cases has hindered the development and maintenance of large systems for applications in the law, ethics, or help-desk settings. New methods that learn a text classifier from a small collection of annotated case summaries, which will classify large numbers of cases automatically, can help overcome this knowledge-acquisition bottleneck. Text learning algorithms used elsewhere are not applicable because they require large training sets. Here, background knowledge about the domain and a linguistic analysis of the examples is employed to develop a better representation of the examples, which will allow learning algorithms to better generalize from small collections of text cases. The project will also yield a better understanding of what makes a good text representation for learning and classification, and the effects of adding background knowledge and natural language processing tools. The experiments are based on a relatively small collection in a well-defined domain, in which the PI and his group have accumulated significant expertise. This unique background allows a more thorough analysis of the experimental results than generally performed. The classifier is evaluated both on a set of marked-up summaries and the corresponding full-length documents. Further experiments explore the use of unseen and unlabeled cases, and explain the observed behavior. The results and the analysis of the experiments will enable researchers in other domains to improve the representation of text cases. Thus, the research results will not only be relevant for case-based reasoning and machine learning, but also for information retrieval and other text-based applications.

View original record on NSF Award Search →