RI: Medium: Collaborative Research: Variance and Invariance in Voice Quality: Implications for Machine and Human Speaker Identification

$867,552FY2017CSENSF

University Of California-Los Angeles, Los Angeles CA

Investigators

Abeer Alwancontact Jody Kreiman Patricia A Keating

Abstract

A talker's voice quality conveys many kinds of information, including word and utterance prosody, emotional state, and personal identity. Variations in both the voice source and the vocal tract affect voice quality and there can be significant inter- and intra-talker variability. Understanding what aspects of a voice are talker-specific should aid in understanding the human limits in perceiving speaker differences and in developing better speaker identification (SID) algorithms. Despite technological advances, the performance of current SID systems remains far from perfect, and degrades significantly when the training and testing conditions are mismatched especially in terms of speech style (conversational versus read for example), speaker's emotional status, when the utterances are short, and when the task is text-independent. The key questions that the project aims to answer are: under normal daily life variability, how often does a talker sound less like him- or herself and more like someone else? Which acoustic properties account for speaker similarity? Can automatic speaker identification (SID) algorithms be improved by knowledge of which properties are important for human perception of speaker similarity? The project is a transformative one and helps better understand and model variance and invariance in voice quality. It will inform several important issues in human speech perception, especially in the area of talker similarity. Understanding what aspects of the source signal, if any, are talker-specific, should aid in developing better speaker identification and verification algorithms that are able to handle short utterances and are robust to varying affect and styles of speaking. A model of voice quality variations could also improve the naturalness of text-to-speech (TTS) systems. If it were known how much a person could change his or her voice quality without compromising their vocal identity, this knowledge could also inform medical rehab applications and forensics. A better understanding of voice quality will thus be of significant impact scientifically, and for engineering, forensic, and medical applications. The project has strong outreach and dissemination programs and fosters interdisciplinary activities in Electrical Engineering, Linguistics, and Speech and Hearing Science at UCLA and the Center of Excellence at JHU. It trains undergraduate and graduate students in important cross-disciplinary activities of technological and scientific significance. The results will be published in high-quality journals and presented at relevant international conferences. The research results - a set of databases, software tools, and publications will be disseminated freely. The project analyzes and discovers how the speech signal varies within and across talkers under circumstances that introduce variability in everyday life situations. Specifically, it investigates whether an individual talker's speech varies significantly across recording sessions and speech tasks. Most importantly, it examines how intra-talker variability from all these sources of variability compares with inter-talker variability. Understanding these issues requires a high-quality speech database with multiple voice samples from many talkers (in this case 200) which are collected, annotated, and distributed to other researchers. Acoustic analyses reveals inter- and intra-talker variability in the speech signal across different situations by generating a multi- dimensional acoustic profile of each talker that specifies the range of parameter values that are typical in the corpus for that talker, and the likelihood of deviations from that usual profile. Perceptual studies determine the extent to which parameter profiles predict perceived similarity, and how much variability in each parameter can be tolerated before talkers cease to sound like themselves. Insights from the acoustic and perceptual studies guide the development of robust text-dependent and text-independent SID algorithms that are anticipated to be robust to variations in affect, style, and for short utterances.

View original record on NSF Award Search →