CAREER: Integrating perceptual models of auditory importance into deep learning-based noise-robust speech recognition
Cuny Brooklyn College, Brooklyn NY
Investigators
Abstract
Hearing is central to human interaction, but the hearing process is not easily observed. The objective of this project is to train models to identify portions of speech utterances that are important to their being correctly identified by human listeners, and to use predictions from these models to make automatic speech recognition (ASR) systems more noise robust by focusing on those regions. The ability to identify important regions of an utterance could significantly advance our understanding of healthy and impaired hearing. Improvements in automatic speech recognition would have broader impacts on the 260 million Americans who use smart phones and the $100 billion ASR industry. The educational portion of this project utilizes examples from speech, language, audio, and music processing to attract and retain students in Brooklyn College's introductory programming course serving a diverse student body along with similar efforts at affiliated high school programs. The team's preliminary results have shown that that some regions of an utterance are more important or useful than others in identifying it by measuring the intelligibility of a given utterance in many different noisy mixtures. This project expands upon these preliminary results in three ways. First it measures ASR auditory importance using the team's existing slow but accurate technique involving random "bubble noise", comparing different ASR variants to each other and to human listeners. Second, it trains a model to predict ASR auditory importance from clean speech using a novel architecture called the bubble cooperative network (BCN) that allows the recognizer to be trained jointly with the BCN to improve performance. Third, it adapts the learned importance predictor to human listeners and uses this human-adapted importance predictor to further refine the ASR models. These tasks should permit the use of utterance-level human responses to directly improve the noise robustness of automatic speech recognition. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →