Collaborative Research: Estimating Articulatory Constriction Place and Timing from Speech Acoustics

$154,000FY2022SBENSF

Haskins Laboratories, Inc., New Haven CT

Investigators

Abstract

This collaborative project focuses on a new approach for using speech recordings to study speaker pronunciation habits--that is, the way speakers systematically coordinate the articulatory movements of their lips, jaw, tongue, glottis and soft palate to produce words and sentences. These articulatory habits differ between individuals, and across languages and dialects of the same language, accounting for many aspects of foreign accent, speech disorders and speaking style. Whereas previous studies of these habits have required specialized equipment for the immediate observation of articulator movements, the aim of this project is to develop and improve a tool for "speech inversion"--that is, a tool that can accurately recover articulatory movements directly from the acoustic speech signal using machine learning methods. To date, the tool developed by the project team has successfully recovered movements of the tongue and lips; the current project extends the tool’s functionality to encompass nasality (soft palate) and voicing (glottis). Training and validation of the extended system will proceed using a newly collected corpus of acoustic and articulatory data drawn from speakers of American English. This corpus, comprising co-collected audio, nasal, voicing, and articulatory movement, will serve as 'ground truth' for training and assessing the capabilities of the fully trained speech inversion system. As a further test, we will test it against ground truth data from speakers of languages with patterns of articulatory habits known to differ from English. The goal of this project is to develop and refine a Speech Inversion Tool that 'reads' acoustic recordings of speech and 'recovers' details of the magnitude and timing of articulatory movements. The project aims to accomplish this goal by training specialized Neural Network models to relate features of the acoustic signal to separately acquired ground-truth nasal vs. oral outflow signals and concurrent electroglottography. Training data derives from native speakers of English; validation and tests for generalization include productions of speakers of Canadian French and Russian. When successfully validated, the resulting speech inversion tool will be useful for identifying medical issues that affect speech movement organization, such as the well-known disruption of oral/laryngeal timing in speakers with dysarthria. In addition, incorporating estimates of articulation may also aid in the tracking of changes resulting from medical conditions such as depression and schizophrenia. More generally, the ability to rapidly and easily analyze articulatory movements obtained from audio recordings alone has the potential substantially improve Automated Speech Recognition (ASR) systems, and to assist scholars, forensic scientists, and clinical professionals studying the speech of communities under field conditions in rural or under-resourced areas, and to help in the documentation of endangered languages. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →