RI: Small: Multi-View Learning of Acoustic Features for Speech Recognition Using Articulatory Measurements

$444,859FY2013CSENSF

Toyota Technological Institute At Chicago, Chicago IL

Investigators

Abstract

This project explores techniques for learning acoustic features for speech recognition, based on multi-view learning using acoustic and articulatory recordings. Recent work has shown recognition improvements using this strategy via linear and nonlinear canonical correlation analysis, in which transformations of acoustic features are learned so as to maximize correlation with (transformations of) articulatory measurements. Prior work has been limited to a single database and a single language. The main goals of this project are to learn better universal features for arbitrary speakers and languages and to develop improved multi-view techniques. Project activities include: learning time-varying projections; multi-view techniques based on neural networks; "many-view" learning using articulation, video, labels, etc.; efficient implementations; new input features such as spectro-temporal filters; and visualization tools for related research and education. A critical component of automatic speech recognition is a representation of the audio signal that encapsulates useful information while discarding acoustic noise, speaker identity, and so on. This project aims to automatically learn improved representations using statistical analysis of audio recordings paired with positions of the speech articulators (lips, tongue, etc.) and other measurements. The project starts with basic statistical techniques, and develops new techniques that address challenges and opportunities specific to speech and related signals. The project's impact extends beyond speech processing. Applications of multi-view representation learning include neurology, meteorology, chemometrics, computer vision, and text processing; all of these can benefit from the improved techniques. The work impacts education by generating materials for a Speech Technologies course and visualization tools for speech and other signals.

View original record on NSF Award Search →