Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

27 Feb 2024 | Yazhou Xing1*, Yingqing He1*, Zeyue Tian1*, Xintao Wang2, Qifeng Chen1
The paper "Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners" addresses the challenge of generating both video and audio content simultaneously, a task that existing methods often handle separately. The authors propose a novel optimization-based framework that leverages pre-trained models to bridge the gap between visual and audio generation. Specifically, they introduce a multimodality latent aligner using the ImageBind model, which shares a common latent representation space for different modalities. This aligner guides the diffusion denoising process during inference, ensuring that the generated content aligns with the input conditions in the ImageBind embedding space. The method is designed to handle four tasks: joint video-audio generation (Joint-VA), video-to-audio (V2A), audio-to-video (A2V), and image-to-audio (I2A). Extensive experiments on various datasets demonstrate the superior performance of the proposed method in terms of audio generation fidelity and audio-visual alignment, outperforming baseline approaches that require large-scale training datasets. The key contributions include a novel paradigm for bridging single-modality diffusion models, the introduction of a diffusion latent aligner, and the demonstration of the method's versatility and generality across multiple tasks.The paper "Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners" addresses the challenge of generating both video and audio content simultaneously, a task that existing methods often handle separately. The authors propose a novel optimization-based framework that leverages pre-trained models to bridge the gap between visual and audio generation. Specifically, they introduce a multimodality latent aligner using the ImageBind model, which shares a common latent representation space for different modalities. This aligner guides the diffusion denoising process during inference, ensuring that the generated content aligns with the input conditions in the ImageBind embedding space. The method is designed to handle four tasks: joint video-audio generation (Joint-VA), video-to-audio (V2A), audio-to-video (A2V), and image-to-audio (I2A). Extensive experiments on various datasets demonstrate the superior performance of the proposed method in terms of audio generation fidelity and audio-visual alignment, outperforming baseline approaches that require large-scale training datasets. The key contributions include a novel paradigm for bridging single-modality diffusion models, the introduction of a diffusion latent aligner, and the demonstration of the method's versatility and generality across multiple tasks.
Reach us at info@study.space
Understanding Seeing and Hearing%3A Open-domain Visual-Audio Generation with Diffusion Latent Aligners