MRMI-TTS: Multi-reference Audios and Mutual Information-Based Zero-shot Speech Synthesis
Abstract
Zero-shot text-to-speech aims to clone target speaker’s voice with only a few data of any target speakers even unseen in training set. Existing multi-speaker systems are capable of preforming high-fidelity speech generation, but clone unseen speakers’ voices is still a challenging task. To generalize to new speakers, previous works use speaker encoder to obtain fixed-size speaker embedding from single reference audio. However single reference audio dose not contain sufficient timbre information of the target speaker, and ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades text-to-speech performances. In this paper, we propose to mitigate these two problems by using multiple reference audios and use content encoder and speaker encoder to obtain content embedding and speaker embedding of reference audios. To get more disentangled representations, the proposed method further uses mutual information minimization between the two embeddings to remove entangled information within each embedding. Experiments on VCTK dataset indicate that our method can improve synthesized speech both in similarity and naturalness even unseen people.
Contents
- Synthesized samples - Seen Speakers
- Synthesized samples – Unseen Speakers
- Synthesized samples – Different number of reference audios
1. Synthesized samples -- Seen Speakers
Using three reference audios
| Models | p259: Behind him was his brother. | P261: You have to see the work. |
|---|---|---|
| reference audios | ||
| Ground Truth | ||
| FS2+speaker ID | ||
| StyleSpeech | ||
| Meta-StyleSpeech | ||
| MRMI-TTS w/o discriminator | ||
| MRMI-TTS w/o MI | ||
| MRMI-TTS |
2. Synthesized samples -- Unseen Speakers
Using three reference audios, reference audios from VCTK.
| Models | P238: She can scoop these things into three red bags , and we will go meet her Wednesday at the train station . | P237:We're in the premier division and we intend to stay there . |
|---|---|---|
| reference audios | ||
| Ground Truth | ||
| StyleSpeech | ||
| Meta-StyleSpeech | ||
| MRMI-TTS w/o discriminator | ||
| MRMI-TTS w/o MI | ||
| MRMI-TTS |
Using three reference audios, reference audios from LibriTTS
| Models | P3570:Wednesday night was a difficult time for Britton . | P4077:Wednesday night was a difficult time for Britton . |
|---|---|---|
| reference audios | ||
| StyleSpeech | ||
| Meta-StyleSpeech | ||
| MRMI-TTS w/o discriminator | ||
| MRMI-TTS w/o MI | ||
| MRMI-TTS |
Using three reference audios, reference audios from AISHELL3
| Models | SSB0005:Wednesday night was a difficult time for Britton . | SSB0535:Wednesday night was a difficult time for Britton . |
|---|---|---|
| reference audios | ||
| StyleSpeech | ||
| Meta-StyleSpeech | ||
| MRMI-TTS w/o discriminator | ||
| MRMI-TTS w/o MI | ||
| MRMI-TTS |
3. Synthesized samples -- different number of reference audios
| models | 1 ref | 3 ref | 5 ref |
|---|---|---|---|
| MIMR-TTS | |||
| ours |