GGrantIndex
← Search

Collaborative Research: RI: Small: Unsupervised Islamicate Manuscript Transcription via Lacunae Reconstruction

$297,845FY2022CSENSF

University Of Maryland, College Park, College Park MD

Investigators

Abstract

This award tackles handwritten text recognition (HTR, the task of automatically transcribing images of handwritten manuscripts into symbolic text) for Islamicate manuscripts, a domain that encompasses Persian and Arabic written traditions originating in the premodern Islamic world (7th-19th centuries). HTR for modern text is itself a challenging problem that has received substantial attention from the fields of machine learning (ML) and artificial intelligence (AI). However, the predominance of modern text in HTR research is, to some extent, waning: current techniques are relatively robust on modern data, and contemporary written media production is already almost entirely digital. In contrast, historical manuscripts have received comparatively less attention from ML and AI, and at the same time represent both an exceptional opportunity for impact and a set of unique challenges for ML techniques. Specifically, the written traditions of the Islamicate world together form one of the largest -- if not the largest -- archives of human cultural production of the premodern world. Scanning and digitization efforts over the last decade have made images of Islamicate manuscripts in a large number of collections available to the public. However, this data remains ‘locked’ for most scholarly uses because it has not been transcribed into symbolic text which is required for many types of analysis. In fact, the script styles used in Islamicate manuscripts -- 'scribal hands' -- vary so widely and differ so substantially from modern forms that even manual close reading of these texts requires expert training and is thus limited to a small subset of researchers. The primary outcome of this project will be new techniques that 'unlock' the Islamicate written tradition by accurately transcribing it. As a result, this project has the potential to be transformative for humanities disciplines such as Islamic and Near Eastern Studies by enabling libraries to accurately transcribe entire collections and, further, by allowing individual researchers to accurately transcribe manuscripts outside the western canon. Finally, this research will also support interdisciplinary training of a diverse set of graduate students at the University of California San Diego and the University of Maryland. Current techniques for HTR require large amounts of in-domain supervised training data in order to produce highly accurate transcriptions. The neural architectures behind these modern methods are capable of generalizing, to some degree, across modern handwriting styles when trained on larger and more diverse collections of transcribed data. However, their limitations make these techniques impractical for large-scale transcription of Islamicate texts for two reasons: (1) scribal hand variation across Islamicate manuscripts is much more pronounced than stylistic variation in modern handwriting; and (2) transcriptions of Islamicate manuscripts that can be used as supervised training data are extremely scarce because accurate manual transcription requires expert training. This project will develop a new unsupervised learning framework for Islamicate HTR centered around a novel pretraining task: lacuna reconstruction. The new approach trains a neural encoder for images of manuscript text lines by learning to reconstruct masked regions -- i.e. lacaunae -- of unlabeled manuscript images. This completely unsupervised training criterion implicitly incentivizes the model to discover and encode discrete This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →