Understanding Audio-Synchronized Visual Animation

This paper introduces Audio-Synchronized Visual Animation (ASVA), a task that aims to animate a static image of an object into a video with clear motion dynamics that are semantically aligned and temporally synchronized with the input audio. To achieve this, the authors propose AVSync15, a high-quality dataset curated from VGGSound, containing 15 dynamic sound classes with 100 examples each. They also introduce AVSyncD, a diffusion model capable of generating audio-guided animations. The dataset and model are evaluated extensively, demonstrating AVSyncD's superior performance in generating synchronized animations. The authors also explore AVSyncD's potential in various audio-synchronized generation tasks, from generating full videos without a base image to controlling object motions with various sounds. The paper highlights the importance of audio-visual synchronization in video generation and presents a new benchmark for this task. The proposed model and dataset are expected to open new avenues for controllable visual generation.This paper introduces Audio-Synchronized Visual Animation (ASVA), a task that aims to animate a static image of an object into a video with clear motion dynamics that are semantically aligned and temporally synchronized with the input audio. To achieve this, the authors propose AVSync15, a high-quality dataset curated from VGGSound, containing 15 dynamic sound classes with 100 examples each. They also introduce AVSyncD, a diffusion model capable of generating audio-guided animations. The dataset and model are evaluated extensively, demonstrating AVSyncD's superior performance in generating synchronized animations. The authors also explore AVSyncD's potential in various audio-synchronized generation tasks, from generating full videos without a base image to controlling object motions with various sounds. The paper highlights the importance of audio-visual synchronization in video generation and presents a new benchmark for this task. The proposed model and dataset are expected to open new avenues for controllable visual generation.

Audio-Synchronized Visual Animation

17 Jul 2024 | Lin Zhang, Shentong Mo, Yijing Zhang, and Pedro Morgado