A corpus-based speech synthesis system with emotion

Akemi Ishii, Nick Campbell, Fumito Higuchi, Michiaki Yasumura

Research output: Contribution to journalArticle

110 Citations (Scopus)

Abstract

We propose a new approach to synthesizing emotional speech by a corpus-based concatenative speech synthesis system (ATR CHATR) using speech corpora of emotional speech. In this study, neither emotional-dependent prosody prediction nor signal processing per se is performed for emotional speech. Instead, a large speech corpus is created per emotion to synthesize speech with the appropriate emotion by simple switching between the emotional corpora. This is made possible by the normalization procedure incorporated in CHATR that transforms its standard predicted prosody range according to the source database in use. We evaluate our approach by creating three kinds of emotional speech corpus (anger, joy, and sadness) from recordings of a male and a female speaker of Japanese. The acoustic characteristics of each corpus are different and the emotions identifiable. The acoustic characteristics of each emotional utterance synthesized by our method show clear correlations to those of each corpus. Perceptual experiments using synthesized speech confirmed that our method can synthesize recognizably emotional speech. We further evaluated the method's intelligibility and the overall impression it gives to the listeners. The results show that the proposed method can synthesize speech with a high intelligibility and gives a favorable impression. With these encouraging results, we have developed a workable text-to-speech system with emotion to support the immediate needs of nonspeaking individuals. This paper describes the proposed method, the design and acoustic characteristics of the corpora, and the results of the perceptual evaluations.

Original languageEnglish
Pages (from-to)161-187
Number of pages27
JournalSpeech Communication
Volume40
Issue number1-2
DOIs
Publication statusPublished - 2003 Apr 1
Externally publishedYes

Fingerprint

Speech synthesis
emotion
Speech intelligibility
acoustics
Acoustics
Emotion
Corpus-based
Speech Synthesis
normalization
anger
listener
Signal processing
recording

Keywords

  • Concatenative speech synthesis
  • Corpus
  • Emotion
  • Natural speech
  • Source database

ASJC Scopus subject areas

  • Software
  • Modelling and Simulation
  • Communication
  • Language and Linguistics
  • Linguistics and Language
  • Computer Vision and Pattern Recognition
  • Computer Science Applications

Cite this

A corpus-based speech synthesis system with emotion. / Ishii, Akemi; Campbell, Nick; Higuchi, Fumito; Yasumura, Michiaki.

In: Speech Communication, Vol. 40, No. 1-2, 01.04.2003, p. 161-187.

Research output: Contribution to journalArticle

Ishii, A, Campbell, N, Higuchi, F & Yasumura, M 2003, 'A corpus-based speech synthesis system with emotion', Speech Communication, vol. 40, no. 1-2, pp. 161-187. https://doi.org/10.1016/S0167-6393(02)00081-X
Ishii, Akemi ; Campbell, Nick ; Higuchi, Fumito ; Yasumura, Michiaki. / A corpus-based speech synthesis system with emotion. In: Speech Communication. 2003 ; Vol. 40, No. 1-2. pp. 161-187.
@article{5d1661bcd70443238d240399fbddb751,
title = "A corpus-based speech synthesis system with emotion",
abstract = "We propose a new approach to synthesizing emotional speech by a corpus-based concatenative speech synthesis system (ATR CHATR) using speech corpora of emotional speech. In this study, neither emotional-dependent prosody prediction nor signal processing per se is performed for emotional speech. Instead, a large speech corpus is created per emotion to synthesize speech with the appropriate emotion by simple switching between the emotional corpora. This is made possible by the normalization procedure incorporated in CHATR that transforms its standard predicted prosody range according to the source database in use. We evaluate our approach by creating three kinds of emotional speech corpus (anger, joy, and sadness) from recordings of a male and a female speaker of Japanese. The acoustic characteristics of each corpus are different and the emotions identifiable. The acoustic characteristics of each emotional utterance synthesized by our method show clear correlations to those of each corpus. Perceptual experiments using synthesized speech confirmed that our method can synthesize recognizably emotional speech. We further evaluated the method's intelligibility and the overall impression it gives to the listeners. The results show that the proposed method can synthesize speech with a high intelligibility and gives a favorable impression. With these encouraging results, we have developed a workable text-to-speech system with emotion to support the immediate needs of nonspeaking individuals. This paper describes the proposed method, the design and acoustic characteristics of the corpora, and the results of the perceptual evaluations.",
keywords = "Concatenative speech synthesis, Corpus, Emotion, Natural speech, Source database",
author = "Akemi Ishii and Nick Campbell and Fumito Higuchi and Michiaki Yasumura",
year = "2003",
month = "4",
day = "1",
doi = "10.1016/S0167-6393(02)00081-X",
language = "English",
volume = "40",
pages = "161--187",
journal = "Speech Communication",
issn = "0167-6393",
publisher = "Elsevier",
number = "1-2",

}

TY - JOUR

T1 - A corpus-based speech synthesis system with emotion

AU - Ishii, Akemi

AU - Campbell, Nick

AU - Higuchi, Fumito

AU - Yasumura, Michiaki

PY - 2003/4/1

Y1 - 2003/4/1

N2 - We propose a new approach to synthesizing emotional speech by a corpus-based concatenative speech synthesis system (ATR CHATR) using speech corpora of emotional speech. In this study, neither emotional-dependent prosody prediction nor signal processing per se is performed for emotional speech. Instead, a large speech corpus is created per emotion to synthesize speech with the appropriate emotion by simple switching between the emotional corpora. This is made possible by the normalization procedure incorporated in CHATR that transforms its standard predicted prosody range according to the source database in use. We evaluate our approach by creating three kinds of emotional speech corpus (anger, joy, and sadness) from recordings of a male and a female speaker of Japanese. The acoustic characteristics of each corpus are different and the emotions identifiable. The acoustic characteristics of each emotional utterance synthesized by our method show clear correlations to those of each corpus. Perceptual experiments using synthesized speech confirmed that our method can synthesize recognizably emotional speech. We further evaluated the method's intelligibility and the overall impression it gives to the listeners. The results show that the proposed method can synthesize speech with a high intelligibility and gives a favorable impression. With these encouraging results, we have developed a workable text-to-speech system with emotion to support the immediate needs of nonspeaking individuals. This paper describes the proposed method, the design and acoustic characteristics of the corpora, and the results of the perceptual evaluations.

AB - We propose a new approach to synthesizing emotional speech by a corpus-based concatenative speech synthesis system (ATR CHATR) using speech corpora of emotional speech. In this study, neither emotional-dependent prosody prediction nor signal processing per se is performed for emotional speech. Instead, a large speech corpus is created per emotion to synthesize speech with the appropriate emotion by simple switching between the emotional corpora. This is made possible by the normalization procedure incorporated in CHATR that transforms its standard predicted prosody range according to the source database in use. We evaluate our approach by creating three kinds of emotional speech corpus (anger, joy, and sadness) from recordings of a male and a female speaker of Japanese. The acoustic characteristics of each corpus are different and the emotions identifiable. The acoustic characteristics of each emotional utterance synthesized by our method show clear correlations to those of each corpus. Perceptual experiments using synthesized speech confirmed that our method can synthesize recognizably emotional speech. We further evaluated the method's intelligibility and the overall impression it gives to the listeners. The results show that the proposed method can synthesize speech with a high intelligibility and gives a favorable impression. With these encouraging results, we have developed a workable text-to-speech system with emotion to support the immediate needs of nonspeaking individuals. This paper describes the proposed method, the design and acoustic characteristics of the corpora, and the results of the perceptual evaluations.

KW - Concatenative speech synthesis

KW - Corpus

KW - Emotion

KW - Natural speech

KW - Source database

UR - http://www.scopus.com/inward/record.url?scp=0037380318&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037380318&partnerID=8YFLogxK

U2 - 10.1016/S0167-6393(02)00081-X

DO - 10.1016/S0167-6393(02)00081-X

M3 - Article

VL - 40

SP - 161

EP - 187

JO - Speech Communication

JF - Speech Communication

SN - 0167-6393

IS - 1-2

ER -