Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

8 Jun 2024 | Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, Haizhou Li
This paper introduces Autoregressive Diffusion Transformers (ARDiTs) for text-to-speech (TTS) synthesis, which generate continuous audio tokens in a continuous space rather than discrete tokens. ARDiTs use a decoder-only diffusion transformer to autoregressively generate sequences of continuous tokens, enabling high-quality speech synthesis and editing. The model is trained using Integral Kullback-Leibler (IKL) divergence for distillation, which improves sample quality and reduces inference latency. ARDiTs can generate 170ms of 24kHz speech per evaluation step with minimal performance degradation. The model is also effective for speech editing, where it can generate speech with missing segments based on a text prompt. ARDiTs outperform existing models in zero-shot TTS and speech editing tasks, achieving near-perfect speech editing on the LibriTTS dataset. The model's performance is evaluated using subjective and objective metrics, including MUSHRA scores and Word Error Rate (WER). The results show that ARDiTs significantly outperform baseline models in both metrics. The paper also discusses the impact of block size on training efficiency and the use of position embeddings to control the total duration of generated speech. The model is trained on the LibriTTS dataset and evaluated on test sets from UniCATS. The study concludes that ARDiTs provide a promising approach for audio generation, with potential applications in various audio tasks.This paper introduces Autoregressive Diffusion Transformers (ARDiTs) for text-to-speech (TTS) synthesis, which generate continuous audio tokens in a continuous space rather than discrete tokens. ARDiTs use a decoder-only diffusion transformer to autoregressively generate sequences of continuous tokens, enabling high-quality speech synthesis and editing. The model is trained using Integral Kullback-Leibler (IKL) divergence for distillation, which improves sample quality and reduces inference latency. ARDiTs can generate 170ms of 24kHz speech per evaluation step with minimal performance degradation. The model is also effective for speech editing, where it can generate speech with missing segments based on a text prompt. ARDiTs outperform existing models in zero-shot TTS and speech editing tasks, achieving near-perfect speech editing on the LibriTTS dataset. The model's performance is evaluated using subjective and objective metrics, including MUSHRA scores and Word Error Rate (WER). The results show that ARDiTs significantly outperform baseline models in both metrics. The paper also discusses the impact of block size on training efficiency and the use of position embeddings to control the total duration of generated speech. The model is trained on the LibriTTS dataset and evaluated on test sets from UniCATS. The study concludes that ARDiTs provide a promising approach for audio generation, with potential applications in various audio tasks.
Reach us at info@study.space