Diff-BGM: A Diffusion Model for Video Background Music Generation

Diff-BGM: A Diffusion Model for Video Background Music Generation

20 May 2024 | Sizhe Li¹, Yiming Qin¹, Minghang Zheng¹, Xin Jin²,³, Yang Liu¹*
Diff-BGM is a diffusion model designed for video background music generation. The paper introduces a high-quality video-music dataset, BGM909, with detailed annotations and shot detection, providing multimodal information for video and music. It also proposes evaluation metrics to assess music quality, including diversity and alignment with video. The Diff-BGM framework automatically generates background music for a given video, using dynamic video features to control music rhythm and semantic features to control melody and atmosphere. A segment-aware cross-attention layer is introduced to sequentially align video and music. Experiments show that Diff-BGM generates high-quality background music and outperforms existing methods. The framework uses diffusion-based models to generate music, with different features controlling different aspects of the generation process. The model also includes a feature selector to choose features based on the generation stage. A segment-aware cross-attention layer is used to align video and music temporally. The paper also presents ablation studies showing that the model's performance improves with the addition of video features and cross-attention layers. The results show that Diff-BGM generates high-quality music with good alignment between video and music. The paper concludes that Diff-BGM is effective for video background music generation and provides new evaluation metrics for video-music correspondence and diversity.Diff-BGM is a diffusion model designed for video background music generation. The paper introduces a high-quality video-music dataset, BGM909, with detailed annotations and shot detection, providing multimodal information for video and music. It also proposes evaluation metrics to assess music quality, including diversity and alignment with video. The Diff-BGM framework automatically generates background music for a given video, using dynamic video features to control music rhythm and semantic features to control melody and atmosphere. A segment-aware cross-attention layer is introduced to sequentially align video and music. Experiments show that Diff-BGM generates high-quality background music and outperforms existing methods. The framework uses diffusion-based models to generate music, with different features controlling different aspects of the generation process. The model also includes a feature selector to choose features based on the generation stage. A segment-aware cross-attention layer is used to align video and music temporally. The paper also presents ablation studies showing that the model's performance improves with the addition of video features and cross-attention layers. The results show that Diff-BGM generates high-quality music with good alignment between video and music. The paper concludes that Diff-BGM is effective for video background music generation and provides new evaluation metrics for video-music correspondence and diversity.
Reach us at info@study.space
[slides] Diff-BGM%3A A Diffusion Model for Video Background Music Generation | StudySpace