SBIR Phase I: VocaliD - Infusing Unique Vocal Identities into Synthesized Speech

$150,000FY2015TIPNSF

Vocalid Inc, Belmont MA

Investigators

Abstract

The broader impact/commercial potential of this Small Business Innovation Research (SBIR) Phase I project is to create custom crafted voices for text-to-speech applications that empower recipients to engage in conversation and be heard in his/her own voice. The company's technology blends a recipient's residual vocal abilities and a matched donor's speech database to craft a personalized voice that combines the recipient's vocal identity with the clarity of the donor's speech. In the United States alone, there are over 2.5 million individuals with speech impairment and 3-5 million individuals with low vision who rely on a limited set of generic, mechanical sounding voices for assisted communication. It is not uncommon for several children in a classroom or adults in a workplace to use the same synthetic voice. Each one of us has a unique voiceprint that conveys our age, gender, race, size, and personality. Until now, this variety and flexibility of voice has not been afforded to users of speech synthesis technology. The company's goal is to give the gift of voice to all those who need and want it to enhance how they learn, work and play. This Small Business Innovation Research (SBIR) Phase I project aims to engineer personalized synthetic voices that convey the recipient's unique vocal identity. The company's innovation is grounded in the source-filter theory of speech production which divides the speech signal into a source component (the vocal folds) and a filter component (the rest of the vocal tract) that are largely independent. Because source and filter characteristics both contribute to speaker identity, the key challenge is to create an authentic, yet understandable voice by extracting as much identity information from recipient vocalizations as possible and combining it with speech clarity information from the donor. Standard voice conversion methods require large amounts of spoken data from donors and recipients as well as parallel corpora, which are not available for the target applications. This Phase I work will make significant advances toward the design and implementation of a novel automated voice matching and transformation process that leverages a large database of healthy donors' speech to generate personalized synthetic voices from sparse samples of target recipients' vocalizations. Algorithms will be validated using both quantitative and perceptual metrics to assess intelligibility, similarity and naturalness.

View original record on NSF Award Search →