Emotional Speech Synthesis Based on Emotion-Timbre Disentanglement

[Paper] [Code]


Jianing Yang1, Sheng Li2, Takahiro Shinozaki2, Yuki Saito1, Hiroshi Saruwatari1

1. The University of Tokyo, Japan

2. Institute of Science Tokyo, Japan

baleyang@g.ecc.u-tokyo.ac.jp

Abstract: Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.

Zero-Shot TTS Samples

Emotion Speaker Reference Proposed FS2+GSTs Stylespeech FS2+MIST DC Comix TTS
neutral female
neutral male
angry female
angry male
happy female
happy male
sad female
sad male
surprise female
surprise male