3 Jun 2024 | Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
**DITTO: Diffusion Inference-Time T-Optimization for Music Generation**
**Abstract:**
We propose DITTO, a framework for controlling pre-trained text-to-music diffusion models at inference time by optimizing initial noise latents. DITTO can be used with any differentiable feature matching loss to achieve target outputs and leverages gradient checkpointing for memory efficiency. We demonstrate its wide range of applications, including inpainting, outpainting, looping, and control over intensity, melody, and musical structure, all without fine-tuning the model. Compared to related methods, DITTO achieves state-of-the-art performance on most tasks, outperforming comparable approaches in controllability, audio quality, and computational efficiency.
**Introduction:**
Large-scale diffusion models have emerged as a leading paradigm for generative media, including text-to-image and text-to-audio generation. Text-conditioned approaches often offer high-level control, but fine-grained control is limited. DITTO addresses this by optimizing initial noise latents during inference, achieving expressive control over music generation without requiring supervised training.
**Related Work:**
We review existing methods for text-conditioned diffusion models, including training-based approaches, inference-time guidance, and optimization-based methods. DITTO stands out by providing fine-grained control without the need for large-scale training or complex sampling algorithms.
**Diffusion Inference-Time T-Optimization:**
We formulate the control task as an optimization problem, optimizing initial noise latents to match target features. Gradient checkpointing is used to manage memory efficiently during optimization.
**Applications and Control Frameworks:**
DITTO is applied to various tasks, including outpainting, inpainting, looping, intensity control, melody control, and musical structure control. We demonstrate its effectiveness through qualitative and quantitative evaluations.
**Experimental Design:**
We evaluate DITTO using a dataset of licensed instrumental music and compare it against several baselines. DITTO achieves state-of-the-art performance in outpainting, inpainting, intensity control, and melody control, while being more efficient in terms of time and memory.
**Results:**
DITTO outperforms baselines in objective metrics and subjective listening tests, showing superior audio quality and controllability. It is also more efficient than other optimization-based approaches, achieving similar convergence speed with less computational cost.
**Conclusion:**
DITTO is a unified, training-free framework for controlling pre-trained diffusion models in music generation, achieving state-of-the-art performance and efficiency. Future work will focus on accelerating the optimization procedure for real-time interaction.**DITTO: Diffusion Inference-Time T-Optimization for Music Generation**
**Abstract:**
We propose DITTO, a framework for controlling pre-trained text-to-music diffusion models at inference time by optimizing initial noise latents. DITTO can be used with any differentiable feature matching loss to achieve target outputs and leverages gradient checkpointing for memory efficiency. We demonstrate its wide range of applications, including inpainting, outpainting, looping, and control over intensity, melody, and musical structure, all without fine-tuning the model. Compared to related methods, DITTO achieves state-of-the-art performance on most tasks, outperforming comparable approaches in controllability, audio quality, and computational efficiency.
**Introduction:**
Large-scale diffusion models have emerged as a leading paradigm for generative media, including text-to-image and text-to-audio generation. Text-conditioned approaches often offer high-level control, but fine-grained control is limited. DITTO addresses this by optimizing initial noise latents during inference, achieving expressive control over music generation without requiring supervised training.
**Related Work:**
We review existing methods for text-conditioned diffusion models, including training-based approaches, inference-time guidance, and optimization-based methods. DITTO stands out by providing fine-grained control without the need for large-scale training or complex sampling algorithms.
**Diffusion Inference-Time T-Optimization:**
We formulate the control task as an optimization problem, optimizing initial noise latents to match target features. Gradient checkpointing is used to manage memory efficiently during optimization.
**Applications and Control Frameworks:**
DITTO is applied to various tasks, including outpainting, inpainting, looping, intensity control, melody control, and musical structure control. We demonstrate its effectiveness through qualitative and quantitative evaluations.
**Experimental Design:**
We evaluate DITTO using a dataset of licensed instrumental music and compare it against several baselines. DITTO achieves state-of-the-art performance in outpainting, inpainting, intensity control, and melody control, while being more efficient in terms of time and memory.
**Results:**
DITTO outperforms baselines in objective metrics and subjective listening tests, showing superior audio quality and controllability. It is also more efficient than other optimization-based approaches, achieving similar convergence speed with less computational cost.
**Conclusion:**
DITTO is a unified, training-free framework for controlling pre-trained diffusion models in music generation, achieving state-of-the-art performance and efficiency. Future work will focus on accelerating the optimization procedure for real-time interaction.