Factorized Diffusion Models

Factorized Diffusion Models are Natural and Zero-shot Speech Synthesizers

Anonymous Authors

Abstract. While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall shorts in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model, which generates attributes in each subspace following its corresponding prompt. With this factorization design, our method can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experimental results show that our method outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.

This page is for research demonstration purposes only.

Overview

The overview of our method, with a neural speech codec for attribute factorization and a factorized diffusion model.

LibriSpeech Samples

Evaluation results on Librispeech test-clean. (P) denotes results from paper. (R) denotes reproduce.

-	Sim-O↑	Sim-R↑	WER↓	CMOS↑	SMOS↑
Ground Truth	0.68	-	1.94	+0.08	3.85
NaturalSpeech 2	0.55	0.62	1.94	-0.18	3.65
Voicebox	0.64	0.67	2.03	-0.23	3.69
Voicebox (R)	0.48	0.50	2.14	-0.32	3.52
VALL-E (P)	-	0.58	5.90	-	-
VALL-E (R)	0.47	0.51	6.11	-0.60	3.46
Mega-TTS 2	0.53	-	2.32	-0.20	3.63
UniAudio	0.57	0.68	2.49	-0.25	3.71
StyleTTS 2	0.38	-	2.49	-0.21	3.07
HierSpeech++	0.51	-	6.33	-0.41	3.50
Our method	0.67	0.76	1.81	0.00	4.01

(R) denotes reproduce.

Text	Our Method	NaturalSpeech 2	Voicebox	Voicebox (R)	VALL-E (R)	Mega-TTS 2	UniAudio	StyleTTS 2	HierSpeech++
It is this that is of interest to theory of knowledge.	Sim-O: 0.73	Sim-O: 0.65	Sim-O: 0.71	Sim-O: 0.49	Sim-O: 0.55	Sim-O: 0.55	Sim-O: 0.62	Sim-O: 0.39	Sim-O: 0.59
For, like as not, they must have thought him a prince when they saw his fine cap.	Sim-O: 0.73	Sim-O: 0.43	Sim-O: 0.62	Sim-O: 0.54	Sim-O: 0.45	Sim-O: 0.42	Sim-O: 0.47	Sim-O: 0.40	Sim-O: 46
What you had best do, my child, is to keep it and pray to it that since it was a witness to your undoing, it will deign to vindicate your cause by its righteous judgment.	Sim-O: 0.69	Sim-O: 0.66	Sim-O: 0.59	Sim-O: 0.59	Sim-O: 0.61	Sim-O: 0.46	Sim-O: 0.53	Sim-O: 0.52	Sim-O: 0.49
The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.	Sim-O: 0.81	Sim-O: 0.64	Sim-O: 0.67	Sim-O: 0.65	Sim-O: 0.65	Sim-O: 0.60	Sim-O: 0.45	Sim-O: 0.44	Sim-O: 0.55

Emotional Samples

Evaluation results on Emotional speech dataset Ravdess. (R) denotes reproduce.

-	average MCD↓	MCD-Acc↑	CMOS↑	SMOS↑
Ground Truth	0.00	1.00	+0.17	4.42
NaturalSpeech 2	4.56	0.25	-0.22	4.04
Voicebox (R)	4.88	0.34	-0.34	3.92
VALL-E (R)	5.03	0.34	-0.55	3.80
Mega-TTS 2	4.44	0.39	-0.20	4.51
StyleTTS 2	4.50	0.40	-0.25	3.98
HierSpeech++	6.08	0.30	-0.37	3.87
Our method	4.28	0.52	0.00	4.72

Emotional samples of our method, using the text from Librispeech "Why fades the lotus of the water".

Prompt Emotion	Prompt	Our Method
neutral
happy
calm
sad
angry
fearful
disgust
surprised

Emotional samples on Ravdess. Ravdess has only two texts: "Dogs are sitting by the door." for prompt text, and "Kids are talking by the door." for synthesis text. (R) denotes reproduce.

Prompt Emotion	Prompt	Ground Truth	Our Method	NaturalSpeech 2	Voicebox (R)	VALL-E (R)	Mega-TTS 2	StyleTTS 2	HierSpeech++
neutral
happy
calm
sad
angry
fearful
disgust
surprised

Attribute Manipulation

Row 1: Original prompt. Row 2: Slowed version. Row 3: Sped-up version. Row 4: New prompt with a fast speech rate.

Manipulation Type	Duration Prompt	Other Prompts	Our Method
-
duration
duration
duration

Row 1: Original prompt. Row 2: Slowed version with a smoother pitch contour. Row 3: New prompt with a sharp pitch contour. Note: the speech rate of the prosody prompt does not affect the speech rate of the generated audio in Row 2.

Manipulation Type	Prosody Prompt	Other Prompts	Our Method
-
prosody
prosody

Row 1: Original prompt. Row 2: New prompt from a different speaker.

Manipulation Type	Timbre Prompt	Other Prompts	Our Method
-
timbre

Row 1: Original prompt. Row 2: New durarion prompt, with a slow speech rate & New prosody prompt, with a smooth pitch contour.

Manipulation Type	Duration Prompt	Prosody Prompt	Other Prompts	Our Method
-
Mix (duration & prosody)

Reconstruction Samples

(R) denotes reproduce.

Ground Truth	Our Codec 4.8 kbps	SoundStream (R) 4.8 kbps	SoundStream (R) 9.6 kbps	Encodec 6.0 kbps	Encodec (R) 5.0 kbps	HiFi-Codec 2.0 kbps	DAC 4.5 kbps

Voice Conversion Samples

Prompt	Source	Our Codec

Ethics Statement

Since our model could synthesize speech with great speaker similarity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. To prevent misuse, it is crucial to develop a robust synthesized speech detection model and establish a system for individuals to report any suspected misuse.