[slides and audio] Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

This paper explores two zero-shot audio editing techniques using pre-trained diffusion models: *Zero-shot Text-based Audio (ZETA)* and *Zero-shot UnSupervised (ZEUS)*. ZETA, adapted from image domain methods, uses text prompts to guide the editing process, allowing for a wide range of manipulations such as changing the genre or specific instruments. ZEUS, a novel approach, discovers semantically meaningful editing directions without supervision, enabling creative modifications like improvisations on the melody while maintaining high perceptual quality and semantic similarity to the original signal. The methods are based on DDPM inversion, which extracts latent noise vectors from the input signal and uses them to generate edited signals. ZETA achieves fine-grained edits by changing text prompts, while ZEUS perturbs the output of the denoiser using the top principal components of the posterior covariance. The paper compares these methods to state-of-the-art models and demonstrates superior performance in generating semantically meaningful modifications. Experiments show that ZEUS can create interesting variations in melody while adhering to the original key, rhythm, and style, and ZETA allows for changes in style, genre, and instrumentation. The methods are evaluated using metrics such as CLAP, LPAPS, and FAD, and a user study confirms their effectiveness.This paper explores two zero-shot audio editing techniques using pre-trained diffusion models: *Zero-shot Text-based Audio (ZETA)* and *Zero-shot UnSupervised (ZEUS)*. ZETA, adapted from image domain methods, uses text prompts to guide the editing process, allowing for a wide range of manipulations such as changing the genre or specific instruments. ZEUS, a novel approach, discovers semantically meaningful editing directions without supervision, enabling creative modifications like improvisations on the melody while maintaining high perceptual quality and semantic similarity to the original signal. The methods are based on DDPM inversion, which extracts latent noise vectors from the input signal and uses them to generate edited signals. ZETA achieves fine-grained edits by changing text prompts, while ZEUS perturbs the output of the denoiser using the top principal components of the posterior covariance. The paper compares these methods to state-of-the-art models and demonstrates superior performance in generating semantically meaningful modifications. Experiments show that ZEUS can create interesting variations in melody while adhering to the original key, rhythm, and style, and ZETA allows for changes in style, genre, and instrumentation. The methods are evaluated using metrics such as CLAP, LPAPS, and FAD, and a user study confirms their effectiveness.

Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

29 May 2024 | Hila Manor, Tomer Michaeli