[slides] Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

The paper introduces the Autoregressive Diffusion Transformer (ARDiT) for text-to-speech (TTS) synthesis, addressing the limitations of discrete audio tokenization in audio language models. ARDiT encodes audio as continuous vector sequences in $\mathbb{R}^d$ and generates these sequences using a decoder-only diffusion transformer. This approach reduces the trade-off between code bitrate and reconstruction accuracy, enabling near-perfect speech editing and high-quality speech generation. The authors propose Distribution Matching Distillation (DMD) to distill ARDiT models, enhancing their performance while reducing inference latency. ARDiTs are trained using a modified attention mask and sequence layout, improving training efficiency. The models are evaluated on the LibriTTS dataset, demonstrating superior zero-shot TTS and speech editing performance compared to state-of-the-art models. The paper also discusses the impact of block size on ARDiT performance and presents a method for controlling the total duration of generated speech using Rotary Position Embeddings (RoPE).The paper introduces the Autoregressive Diffusion Transformer (ARDiT) for text-to-speech (TTS) synthesis, addressing the limitations of discrete audio tokenization in audio language models. ARDiT encodes audio as continuous vector sequences in $\mathbb{R}^d$ and generates these sequences using a decoder-only diffusion transformer. This approach reduces the trade-off between code bitrate and reconstruction accuracy, enabling near-perfect speech editing and high-quality speech generation. The authors propose Distribution Matching Distillation (DMD) to distill ARDiT models, enhancing their performance while reducing inference latency. ARDiTs are trained using a modified attention mask and sequence layout, improving training efficiency. The models are evaluated on the LibriTTS dataset, demonstrating superior zero-shot TTS and speech editing performance compared to state-of-the-art models. The paper also discusses the impact of block size on ARDiT performance and presents a method for controlling the total duration of generated speech using Rotary Position Embeddings (RoPE).

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

8 Jun 2024 | Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, Haizhou Li