SBIR Phase I: Speech Synthesis System Based on Articulation Modeling

$97,000FY2006TIPNSF

Red Weather Technologies, Cary NC

Investigators

Abstract

This Small Business Innovation Research (SBIR) Phase I research project explores the technical merits of developing a speech synthesis system based on modeling the human articulation mechanism. Existing automatic speech generation technology, largely based on piecing together components of pre-recorded voice, is optimized for efficiency instead of sound quality. Even the best synthetic voices, though intelligible, sound artificial and jarring. Quality issues, combined with a lack of variety and prosody control, have prevented the wider deployment of synthetic speech in computers, telephony interfaces and assistive devices, and expansion into new markets, such as video games and computer animation. This project seeks to transform two vital aspects of speech synthesis: phoneme-to-audio conversion and voice customization. Phoneme-to-audio conversion is improved by emulating the human sound production mechanism through simulation of the human vocal tract. A basic model using MRI images is constructed and adjusted for each sound using 3D animation techniques. Airflow within the vocal tract is computed using a fluid dynamics tool. This airflow pattern, coupled with specialized signal processing, is converted into a sound waveform. Speech synthesized in this manner is expected to sound more natural. This approach also allows parametric voice customization. By adjusting the physical model of the simulated vocal tract, the voice can be made flat or sonorous, shrill or deep, and hoarse or mellifluous. This feature provides a variety of voices. The proposed technology has a broad set of commercial applications categorized as follows: (1) Human-computer interaction is the chief market for speech synthesis products. The largest application in this category is automated telephony services. Other applications include spoken interfaces for personal computers, digital assistants, and automotive navigation. All these applications require natural sounding voices to improve the customer experience. (2) Speech synthesis is used as an adaptive technology for the speech, hearing, and visually impaired. Both voice quality and customization are key to broadening the range and usefulness of assistive products. (3) Content creation is a newly addressable market with high quality speech synthesis. Video games and computer animations typically need a large variety of voices. Hiring voice talent is prohibitively expensive for smaller companies and artists. The proposed technology provides a convenient tool to incorporate natural sounding voices into games, animations, and education material at a reasonable cost. Beyond commercial applications, this technology is a valuable tool for several current research areas including speech disorders, linguistics, and speech physiology. Finally, this is a small piece required for the creation of a total simulation of the human body, a goal rigorously pursued in both academia and public health institutions.

View original record on NSF Award Search →