20 May 2024 | Sizhe Li, Yiming Qin, Minghang Zheng, Xin Jin, Yang Liu
This paper introduces Diff-BGM, a diffusion model for video background music generation. The authors address challenges in this task, such as the lack of suitable training datasets and difficulties in controlling music generation and aligning it with video. They propose BGM909, a high-quality music-video dataset with detailed annotations and shot detection, providing multi-modal information about the video and music. They also present evaluation metrics to assess music quality, including diversity and alignment between music and video. The Diff-BGM framework automatically generates background music for a given video, using dynamic video features to control rhythm and semantic features to control melody and atmosphere. A segment-aware cross-attention layer is introduced to align video and music sequentially. Experiments verify the effectiveness of their approach, and the code and models are available on GitHub. The paper also discusses related work in music generation and background music generation, highlighting the limitations of existing methods and the advantages of the proposed approach. The dataset and method are evaluated through objective and subjective experiments, showing that Diff-BGM outperforms existing models in generating high-quality background music that aligns well with videos.This paper introduces Diff-BGM, a diffusion model for video background music generation. The authors address challenges in this task, such as the lack of suitable training datasets and difficulties in controlling music generation and aligning it with video. They propose BGM909, a high-quality music-video dataset with detailed annotations and shot detection, providing multi-modal information about the video and music. They also present evaluation metrics to assess music quality, including diversity and alignment between music and video. The Diff-BGM framework automatically generates background music for a given video, using dynamic video features to control rhythm and semantic features to control melody and atmosphere. A segment-aware cross-attention layer is introduced to align video and music sequentially. Experiments verify the effectiveness of their approach, and the code and models are available on GitHub. The paper also discusses related work in music generation and background music generation, highlighting the limitations of existing methods and the advantages of the proposed approach. The dataset and method are evaluated through objective and subjective experiments, showing that Diff-BGM outperforms existing models in generating high-quality background music that aligns well with videos.