On Partially Supervised Text Classification

$230,187FY2003CSENSF

University Of Illinois At Chicago, Chicago IL

Investigators

Abstract

Text classification is the automated assignment of text documents to pre-defined categories/classes. Traditionally, a classifier is built using training documents labeled (often manually) with pre-defined classes. This classifier is then used to classify new documents into those classes. This classic model is called supervised classification because the training documents all have pre-labeled classes. Although this model is important, in practice it may not be applicable in some common situations. For example, given a set of documents of a particular class P (positive class) for training, one may want to classify a set M of mixed (unlabeled) documents that contains documents from class P along with other types of documents (negative documents) into documents from P and documents not from P. Since there are no labeled negative documents, the traditional classification techniques are inapplicable. This problem is called partially supervised classification (PSC). Existing methods for solving PSC are based on heuristics, which are prone to error. Researchers typically use the evaluation method for supervised classification to evaluate PSC techniques, which is inadequate because supervised classification approaches assume the availability of labeled negative documents. The objectives of this project are to design a robust and principled technique to solve PSC, implement a system for PSC, devise a method to evaluate such techniques, and identify methods for determining the minimum number of labeled documents needed to achieve the optimal accuracy in order to reduce manual labeling efforts. The algorithms and systems developed by this project will be used in a graduate course and put on the Web for other researchers to use in their research and teaching. The results of this research should be widely useful because the identification of targeted information/documents is of great value in this information age.

View original record on NSF Award Search →