CAREER: Robust and Secure Multi-Modal Learning for Library-Scale Text Collections

$566,000FY2017CSENSF

Cornell University, Ithaca NY

Investigators

Abstract

The growth of social media and digitized libraries has made computational text analysis a vital tool for modern scholarship. But too often methods that work on standardized collections for expert users don't translate to real-world data analysis. In order to be useful, text mining methodologies need to balance theoretical power with practical application. Real data sets are noisy and complicated. More importantly, vast amounts of data cannot be shared directly due to copyright, including all published books after 1923. This project will develop tools that can be applied to limited, privatized views of documents. Algorithms will focus on reliability and efficiency, so that powerful techniques can be used by non-expert users on easily accessible hardware, such as the 10 million K-12 students using low-powered browser-based Chromebooks thereby increasing the societal impact of the work. Unsupervised text mining methods such as topic models and word embeddings have become popular outside of machine learning because they operate on simple, widely-available representations and identify latent variables that represent recognizable themes, events, or concepts. But standard algorithms do not scale well, require full access to potentially sensitive text collections, and cannot take advantage of non-textual data such as images. Although recent work in spectral inference has produced improvements in speed, current methods are plagued by sensitivity to noisy observations. This work will develop a unified approach to unsupervised text mining based on matrix and tensor factorization. The project will focus on data rectification methods for input matrices, enabling simple algorithms to work dramatically better, even in the presence of sparse and noisy observations, while also reducing model uncertainty. The project will develop new methods for learning from private and sensitive documents by creating public views of non-public data. These will include both noisy representations of individual documents as well as corpus-level summary matrices, and support both strong non-identifiability and weaker non-expressivity criteria. Finally, the project will develop new tools for modeling images and text optimized for the way images actually accompany text in real corpora, rather than short, artificial captions. By jointly modeling large volumes of text and semantically related images, the project will enable users to search for contextually related images, not just visually similar images, and identify topics that are grounded in the visual world, not just in text. For further information see the project web page: http://mimno.infosci.cornell.edu

View original record on NSF Award Search →