HCC: Small: Modeling Acoustic and Articulatory Features for Hybrid Synthesis

$212,902FY2011CSENSF

Northeastern University, Boston MA

Investigators

Abstract

In recent decades synthetic speech has become a ubiquitous and increasingly seamless aspect of human-machine interfaces. Although cars, microwaves, phones, and kiosks all "talk" in human-like ways, the naturalness and personality of these voices fall short of human expression. While this may not matter for many text-to-speech (TTS) applications, over two million Americans with severe speech-motor impairments require assistive communication aids with TTS output. Concatenative TTS synthesizers yield highly intelligible voices, yet many assistive devices rely on small footprint, formant synthesis that sounds robotic and has poor intelligibility. Moreover, the choice of voices on conventional devices is limited and does not reflect the user; it is not uncommon for a child to use the same voice her whole life and for her peers to share that same voice even when using different devices. This lack of attention to the individuality of synthetic voices has consequences on adoption of assistive technology as an extension of the user, and may adversely impact societal attitudes toward the user group. In her prior work the PI began to address these issues by adapting a concatenative synthesizer constructed from acoustic recordings of a healthy talker using vocal source characteristics obtained from a target talker with speech impairment. The adapted voice was highly intelligible and conveyed the target user's identity, yet it also retained substantial elements of the healthy talker's identity due to the influence of vocal tract filter characteristics. This suggests that personalized speech synthesis may be more successful utilizing an alternative approach, in which acoustic and articulatory data from healthy talkers are combined with both source and filter characteristics from target talkers to generate an individualized voice. In this project, the PI will develop hybrid statistical parametric synthesis techniques to model vocal tract and source characteristics of impaired talkers, with the goal of generating highly intelligible and personalized synthetic speech. The PI envisages a future where source and filter parameters of a Hidden Markov Model (HMM) based synthesizer can be adapted to model a child user's vocal tract and modified over time to "grow" with his maturing vocal system, fostering a stronger personal connection between the user and the communication device. Broader Impacts: This project strives to make communication accessible and socially fulfilling by designing an enabling technology that blurs the line between system and user. The human voice is not merely a signal; it has an individualized and personal quality that impacts how others perceive us and how we interact with those around us. The ultimate goal of this work is to afford users of TTS the same ownership and individuality as the natural voice. Project outcomes will have broad impact both on users of assistive aids and able-bodied users of TTS technologies. The research may also lead to a novel and innovative means of assessing the nature and articulatory locus of speech impairment, by comparing model parameters to impaired productions. The interdisciplinary nature of this research will promote teaching, training and learning in computer science and in speech and hearing sciences.

View original record on NSF Award Search →