CAREER: New Directions in Probabilistic Topic Models

$549,943FY2008CSENSF

Princeton University, Princeton NJ

Investigators

Abstract

There is a growing need for (semi-)automated tools to analyze and organize large collections of electronic information. In response, there is a surge of research on machine learning of probabilistic topic models, which automatically discover the hidden thematic structure in a large collection of documents. Once made explicit, this hidden structure facilitates browsing, searching, organizing, and summarizing vast amounts of information. This research program will significantly build on the current state-of-the-art in topic modeling. 1. We will develop topic modeling algorithms that discover trends in document streams. Modeling evolutionary and revolutionary change of topics over time will be an important new capability for corpora analysts, providing methods of forecasting and understanding the changing patterns in serial collections such as news feeds, scientific publications, or web blogs. 2. Many modern corpora, such as Wikipedia, contain important links between the documents. We will develop topic models of such interconnected collections that explicitly represent and generalize inter-document and/or inter-topic relationships. Such relationships may be hyper-links, scholarly citation, shared authorship, or statistical correlations. Capturing the patterns in these connections, and understanding their relationship to the texts, will have important implications for a great variety of scholarly, commercial, and personal 'recommender' systems. 3. Very often, analysts and other users approach a corpora with particular questions in mind. To facilitate focused, personalized exploration, we will develop supervised methods for discovering topic models that predict document-specific variables -- notably forms of relevance -- of online material such as scholarly papers, legal briefs, media sources, and product specifications. This project addresses significant current limitations of topic modeling, and will provide practical new research and education tools for understanding and organizing modern repositories of information. We will make these tools available as open-source software to support and encourage their application to real-world problems, and we will fold the results of our research into ongoing education and outreach programs.

View original record on NSF Award Search →