17 Jul 2024 | Lin Zhang, Shentong Mo, Yijing Zhang, and Pedro Morgado
The paper introduces Audio-Synchronized Visual Animation (ASVA), a task that aims to generate temporally synchronized image animations guided by audio clips. The authors address the challenge of controlling object dynamics in video generation by proposing AVSync15, a high-quality dataset curated from VGGSound, and AVSyncD, a diffusion model capable of generating audio-guided animations. AVSync15 includes 15 categories with strong audio-visual synchronization cues, ensuring high-quality and diverse training data. AVSyncD enhances a pre-trained image latent diffusion model with trainable temporal layers and audio conditioning mechanisms, allowing for precise audio guidance and motion generation. Extensive evaluations validate the effectiveness of AVSync15 and AVSyncD, demonstrating superior performance in various audio-synchronized generation tasks. The work opens new avenues for controllable visual generation, particularly in generating full videos without a base image and controlling object motions with different sounds.The paper introduces Audio-Synchronized Visual Animation (ASVA), a task that aims to generate temporally synchronized image animations guided by audio clips. The authors address the challenge of controlling object dynamics in video generation by proposing AVSync15, a high-quality dataset curated from VGGSound, and AVSyncD, a diffusion model capable of generating audio-guided animations. AVSync15 includes 15 categories with strong audio-visual synchronization cues, ensuring high-quality and diverse training data. AVSyncD enhances a pre-trained image latent diffusion model with trainable temporal layers and audio conditioning mechanisms, allowing for precise audio guidance and motion generation. Extensive evaluations validate the effectiveness of AVSync15 and AVSyncD, demonstrating superior performance in various audio-synchronized generation tasks. The work opens new avenues for controllable visual generation, particularly in generating full videos without a base image and controlling object motions with different sounds.