Factorized Diffusion Models are Natural and Zero-shot Speech Synthesizers
Anonymous Authors
Abstract.
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall shorts in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model, which generates attributes in each subspace following its corresponding prompt. With this factorization design, our method can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experimental results show that our method outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
This page is for research demonstration purposes only.
Overview
The overview of our method, with a neural speech codec for attribute factorization and a factorized diffusion model.
LibriSpeech Samples
Evaluation results on Librispeech test-clean. (P) denotes results from paper. (R) denotes reproduce.
-
Sim-O↑
Sim-R↑
WER↓
CMOS↑
SMOS↑
Ground Truth
0.68
-
1.94
+0.08
3.85
NaturalSpeech 2
0.55
0.62
1.94
-0.18
3.65
Voicebox
0.64
0.67
2.03
-0.23
3.69
Voicebox (R)
0.48
0.50
2.14
-0.32
3.52
VALL-E (P)
-
0.58
5.90
-
-
VALL-E (R)
0.47
0.51
6.11
-0.60
3.46
Mega-TTS 2
0.53
-
2.32
-0.20
3.63
UniAudio
0.57
0.68
2.49
-0.25
3.71
StyleTTS 2
0.38
-
2.49
-0.21
3.07
HierSpeech++
0.51
-
6.33
-0.41
3.50
Our method
0.67
0.76
1.81
0.00
4.01
(R) denotes reproduce.
Text
Prompt
Ground Truth
Our Method
NaturalSpeech 2
Voicebox
Voicebox (R)
VALL-E (R)
Mega-TTS 2
UniAudio
StyleTTS 2
HierSpeech++
It is this that is of interest to theory of knowledge.
Sim-O: 0.73
Sim-O: 0.65
Sim-O: 0.71
Sim-O: 0.49
Sim-O: 0.55
Sim-O: 0.55
Sim-O: 0.62
Sim-O: 0.39
Sim-O: 0.59
For, like as not, they must have thought him a prince when they saw his fine cap.
Sim-O: 0.73
Sim-O: 0.43
Sim-O: 0.62
Sim-O: 0.54
Sim-O: 0.45
Sim-O: 0.42
Sim-O: 0.47
Sim-O: 0.40
Sim-O: 46
What you had best do, my child, is to keep it and pray to it that since it was a witness to your undoing, it will deign to vindicate your cause by its righteous judgment.
Sim-O: 0.69
Sim-O: 0.66
Sim-O: 0.59
Sim-O: 0.59
Sim-O: 0.61
Sim-O: 0.46
Sim-O: 0.53
Sim-O: 0.52
Sim-O: 0.49
The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.
Sim-O: 0.81
Sim-O: 0.64
Sim-O: 0.67
Sim-O: 0.65
Sim-O: 0.65
Sim-O: 0.60
Sim-O: 0.45
Sim-O: 0.44
Sim-O: 0.55
Emotional Samples
Evaluation results on Emotional speech dataset Ravdess. (R) denotes reproduce.
-
average MCD↓
MCD-Acc↑
CMOS↑
SMOS↑
Ground Truth
0.00
1.00
+0.17
4.42
NaturalSpeech 2
4.56
0.25
-0.22
4.04
Voicebox (R)
4.88
0.34
-0.34
3.92
VALL-E (R)
5.03
0.34
-0.55
3.80
Mega-TTS 2
4.44
0.39
-0.20
4.51
StyleTTS 2
4.50
0.40
-0.25
3.98
HierSpeech++
6.08
0.30
-0.37
3.87
Our method
4.28
0.52
0.00
4.72
Emotional samples of our method, using the text from Librispeech "Why fades the lotus of the water".
Prompt Emotion
Prompt
Our Method
neutral
happy
calm
sad
angry
fearful
disgust
surprised
Emotional samples on Ravdess. Ravdess has only two texts: "Dogs are sitting by the door." for prompt text, and "Kids are talking by the door." for synthesis text. (R) denotes reproduce.
Prompt Emotion
Prompt
Ground Truth
Our Method
NaturalSpeech 2
Voicebox (R)
VALL-E (R)
Mega-TTS 2
StyleTTS 2
HierSpeech++
neutral
happy
calm
sad
angry
fearful
disgust
surprised
Attribute Manipulation
Row 1: Original prompt. Row 2: Slowed version. Row 3: Sped-up version. Row 4: New prompt with a fast speech rate.
Manipulation Type
Duration Prompt
Other Prompts
Our Method
-
duration
duration
duration
Row 1: Original prompt. Row 2: Slowed version with a smoother pitch contour. Row 3: New prompt with a sharp pitch contour. Note: the speech rate of the prosody prompt does not affect the speech rate of the generated audio in Row 2.
Manipulation Type
Prosody Prompt
Other Prompts
Our Method
-
prosody
prosody
Row 1: Original prompt. Row 2: New prompt from a different speaker.
Manipulation Type
Timbre Prompt
Other Prompts
Our Method
-
timbre
Row 1: Original prompt. Row 2: New durarion prompt, with a slow speech rate & New prosody prompt, with a smooth pitch contour.
Manipulation Type
Duration Prompt
Prosody Prompt
Other Prompts
Our Method
-
Mix (duration & prosody)
Reconstruction Samples
(R) denotes reproduce.
Ground Truth
Our Codec 4.8 kbps
SoundStream (R) 4.8 kbps
SoundStream (R) 9.6 kbps
Encodec 6.0 kbps
Encodec (R) 5.0 kbps
HiFi-Codec 2.0 kbps
DAC 4.5 kbps
Voice Conversion Samples
Prompt
Source
Our Codec
Ethics Statement
Since our model could synthesize speech with great speaker similarity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. To prevent misuse, it is crucial to develop a robust synthesized speech detection model and establish a system for individuals to report any suspected misuse.