DITTO: Diffusion Inference-Time T-Optimization for Music Generation

DITTO: Diffusion Inference-Time T-Optimization for Music Generation

2024 | Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
DITTO is a training-free framework for controlling pre-trained text-to-music diffusion models at inference time by optimizing initial noise latents. It enables a wide range of music generation tasks, including inpainting, outpainting, looping, intensity, melody, and musical structure control, without requiring model fine-tuning. DITTO achieves state-of-the-art performance across various tasks, outperforming existing methods in controllability, audio quality, and computational efficiency. It uses gradient checkpointing for memory efficiency and optimizes initial noise latents with respect to arbitrary differentiable feature matching losses. DITTO is 2x faster and uses half the memory compared to similar methods. It demonstrates strong performance in both objective and subjective evaluations, showing superior quality and control over generated music. DITTO also provides efficient memory usage and maintains flexibility in control without restricting the model architecture or sampling process. The framework allows for a wide range of creative control tasks, including intensity, melody, and structure control, and is applicable to various music generation scenarios. DITTO's approach enables high-quality, flexible, training-free control of diffusion models for music generation.DITTO is a training-free framework for controlling pre-trained text-to-music diffusion models at inference time by optimizing initial noise latents. It enables a wide range of music generation tasks, including inpainting, outpainting, looping, intensity, melody, and musical structure control, without requiring model fine-tuning. DITTO achieves state-of-the-art performance across various tasks, outperforming existing methods in controllability, audio quality, and computational efficiency. It uses gradient checkpointing for memory efficiency and optimizes initial noise latents with respect to arbitrary differentiable feature matching losses. DITTO is 2x faster and uses half the memory compared to similar methods. It demonstrates strong performance in both objective and subjective evaluations, showing superior quality and control over generated music. DITTO also provides efficient memory usage and maintains flexibility in control without restricting the model architecture or sampling process. The framework allows for a wide range of creative control tasks, including intensity, melody, and structure control, and is applicable to various music generation scenarios. DITTO's approach enables high-quality, flexible, training-free control of diffusion models for music generation.
Reach us at info@study.space
[slides] DITTO%3A Diffusion Inference-Time T-Optimization for Music Generation | StudySpace