ITR: Automatic Content Categorization and Annotation with Ambiguous Training Information

$300,000FY2003CSENSF

Brown University, Providence RI

Investigators

Abstract

Machine learning algorithms are crucial for efficiently automating key tasks in content and information management such as content annotation and categorization, taxonomy creation, content linking, information routing and filtering, robust search, and information extraction. One key factor that has limited the success of machine learning methods in this domain is a conceptual mismatch: The formal assumptions with regard to the type and nature of available training information often differ from what is actually available in real-world applications. This project addresses this mismatch by developing innovative machine learning algorithms and architectures that can make use of more realistic types of training information. This includes in particular the use of weakly labeled data, i.e. training data with ambiguous or incomplete annotations, and a systematic exploitation of dependencies between concepts and between concept annotations. The proposed research will lead to algorithms that have an increased range of applicability and that are more accurate and robust. The scope of the project encompasses well-known special cases like multiple instance learning, label ambiguity, learning with concept taxonomies, learning with overlapping concepts, and label sequence learning. As a proof of concept, tools for categorizing medical documents and for supporting content-based image search will be developed.

View original record on NSF Award Search →