RI: Small: Collaborative Research: Unsupervised Transcription of Early Modern Documents

$249,458FY2016CSENSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

Recently, researchers in the social sciences and humanities have made increasing use of digital technologies in their work, seeking to answer important questions about human artifacts based on new kinds of analyses. However, since many of their methods are statistical in nature, they require a large amount of digitally readable text to operate. For example, to ask statistical questions about how the legal rights of women have changed during the past five centuries, a large and unbiased sample of court proceedings spanning that time period has to be accessible in digital form. Unfortunately, for many time periods this data is not available, not because the historical documents have been lost, but because they cannot be efficiently transcribed. In particular, the 400 years just after the invention of the printing press (the early modern period, ca. 1450-1850) represents a critical dark period for such research because documents from this period are notoriously hard to transcribe into machine-readable text with automatic methods for three reasons: they use obscure and unknown fonts, their text differs from modern language, and historical printing processes were imprecise. This proposal seeks to address these issues by treating transcription as a type of code-breaking and using machine learning to induce font and text structure directly from unannotated document images without relying on annotated examples, an approach called unsupervised learning. As a result, the proposal aims not just to digitize existing early modern corpora in major libraries, but also to produce a tool that researchers can use to digitize data at scale themselves and that is sufficiently flexible to develop new representations, for example, of non-standard character sets. The proposed approach treats the problem of document transcription as a linguistic decipherment problem, leveraging modeling techniques from work on decrypting historical ciphers. The key idea is that while properties like font and text structure are document-specific and therefore difficult to treat generally with supervised techniques, these phenomena are in fact regular within individual documents. For example, while the shape of a particular character in an obscure historical font may be unknown to the system, that shape is in fact regular; every time the character is printed it uses the same template. Models that leverage this kind of regularity by incorporating it as an assumption can constrain the otherwise difficult unsupervised learning problem and make it feasible. This proposal introduces a class of generative models with this goal in mind, designed to learn fonts and predict accurate transcriptions in an unsupervised fashion by capturing the core properties of the process that generated the input data: the historical printing process. These models represent the specific types of printing and typesetting noise exhibited by early modern documents, treat typesetting as a latent variable, and jointly consider possible character segmentations and transcriptions during inference. Their parameters can be estimated efficiently, directly from images of historical documents without accompanying transcriptions. Further, by treating damaged portions of the input documents as latent variables, this proposal aims to automatically reconstruct damaged documents using the same approach. The unsupervised techniques developed here may have uses in other areas of natural language processing where annotated training data is hard to obtain; for example, in personalized speech recognition and grounded semantics.

View original record on NSF Award Search →