VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

6 Jun 2024 | Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo
This paper presents VidMuse, a novel framework for generating music aligned with video inputs, focusing on both audio quality and semantic alignment. The authors construct a large-scale dataset, V2M, containing 190K video-music pairs from various genres such as movie trailers, advertisements, and documentaries. VidMuse integrates local and global visual cues to produce high-fidelity music that matches the video content through Long-Short-Term modeling. The core components of VidMuse include a Long-Short-Term Visual Module and a Music Token Decoder. The Long-Short-Term Visual Module captures both fine-grained and coarse-grained visual features, while the Music Token Decoder converts video embeddings into music tokens. Extensive experiments demonstrate that VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The paper also includes a user study and ablation studies to validate the effectiveness of VidMuse's design choices.This paper presents VidMuse, a novel framework for generating music aligned with video inputs, focusing on both audio quality and semantic alignment. The authors construct a large-scale dataset, V2M, containing 190K video-music pairs from various genres such as movie trailers, advertisements, and documentaries. VidMuse integrates local and global visual cues to produce high-fidelity music that matches the video content through Long-Short-Term modeling. The core components of VidMuse include a Long-Short-Term Visual Module and a Music Token Decoder. The Long-Short-Term Visual Module captures both fine-grained and coarse-grained visual features, while the Music Token Decoder converts video embeddings into music tokens. Extensive experiments demonstrate that VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The paper also includes a user study and ablation studies to validate the effectiveness of VidMuse's design choices.
Reach us at info@study.space
[slides] VidMuse%3A A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling | StudySpace