cythc-github.io

MRMI-TTS: Multi-reference Audios and Mutual Information-Based Zero-shot Speech Synthesis

Abstract

Zero-shot text-to-speech aims to clone target speaker’s voice with only a few data of any target speakers even unseen in training set. Existing multi-speaker systems are capable of preforming high-fidelity speech generation, but clone unseen speakers’ voices is still a challenging task. To generalize to new speakers, previous works use speaker encoder to obtain fixed-size speaker embedding from single reference audio. However single reference audio dose not contain sufficient timbre information of the target speaker, and ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades text-to-speech performances. In this paper, we propose to mitigate these two problems by using multiple reference audios and use content encoder and speaker encoder to obtain content embedding and speaker embedding of reference audios. To get more disentangled representations, the proposed method further uses mutual information minimization between the two embeddings to remove entangled information within each embedding. Experiments on VCTK dataset indicate that our method can improve synthesized speech both in similarity and naturalness even unseen people.

Synthesized samples - Seen Speakers
Synthesized samples – Unseen Speakers
Synthesized samples – Different number of reference audios

1. Synthesized samples -- Seen Speakers

Using three reference audios

Models	p259: Behind him was his brother.	P261: You have to see the work.
reference audios
Ground Truth
FS2+speaker ID
StyleSpeech
Meta-StyleSpeech
MRMI-TTS w/o discriminator
MRMI-TTS w/o MI
MRMI-TTS

2. Synthesized samples -- Unseen Speakers

Using three reference audios, reference audios from VCTK.

Models	P238: She can scoop these things into three red bags , and we will go meet her Wednesday at the train station .	P237:We're in the premier division and we intend to stay there .
reference audios
Ground Truth
StyleSpeech
Meta-StyleSpeech
MRMI-TTS w/o discriminator
MRMI-TTS w/o MI
MRMI-TTS

Using three reference audios, reference audios from LibriTTS

Models	P3570:Wednesday night was a difficult time for Britton .	P4077:Wednesday night was a difficult time for Britton .
reference audios
StyleSpeech
Meta-StyleSpeech
MRMI-TTS w/o discriminator
MRMI-TTS w/o MI
MRMI-TTS

Using three reference audios, reference audios from AISHELL3

Models	SSB0005:Wednesday night was a difficult time for Britton .	SSB0535:Wednesday night was a difficult time for Britton .
reference audios
StyleSpeech
Meta-StyleSpeech
MRMI-TTS w/o discriminator
MRMI-TTS w/o MI
MRMI-TTS