TRANSFORM: flexible voice synthesis through articulatory voice transformation

$327,142FY2005CSENSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

Many people have always wanted machines to talk to them, but most have strong preferences for particular voices. Current techniques in speech synthesis can build voices that sound very close to the original speaker, capturing the style, manner and articulation of the source voice. However such systems require many hours of carefully recorded speech and expert tuning to reach an acceptable level of quality. An exciting new alternative method for building synthetic voices is voice transformation. This method uses an existing recorded database and converts it to a target voice using as little as 10-20 sentences. This technique offers the potential to make speech synthesizers talk in whatever voice desired, with significantly less effort required than previous techniques. This project offers a new direction in voice transformation. Current transformation techniques concentrate on a spectral mapping of the voice, i.e., converting the properties of the speech signal. Instead we use the underlying positions of the vocal tract articulators (i.e., the position of the teeth, tongue, lips, velum), which give rise to the spectral output of the voice. Using new statistical modeling techniques we can successfully predict the positions of a speaker's articulators from the speech signal. Then in the virtual vocal tract domain map between speakers and regenerate the speech for the target voice. This work enables the easy construction of new synthetic voices allowing personalization of speech output. It increases our knowledge of the speech generation process and characterizes what make a voice personal.

View original record on NSF Award Search →