FRIEREN: Efficient Video-to-Audio Generation with Rectified Flow Matching

FRIEREN: Efficient Video-to-Audio Generation with Rectified Flow Matching

9 Jul 2024 | Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao
FRIEREN is an efficient video-to-audio (V2A) generation model that leverages rectified flow matching to synthesize content-matching audio from silent video. The model regresses the conditional transport vector field from noise to spectrogram latent using straight paths and samples by solving ordinary differential equations (ODEs), outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, FRIEREN generates highly synchronized audio with the input video. Additionally, through reflow and one-step distillation with guided vector field, the model can generate decent audio in a few or even just one sampling step. Experiments on the VGGSound dataset show that FRIEREN achieves state-of-the-art performance in both generation quality and temporal alignment, with alignment accuracy reaching 97.22% and a 6.2% improvement in inception score over strong diffusion-based baselines. The model also demonstrates significant efficiency improvements, achieving a 7.3× speedup compared to Diff-Foley and a 9.3× acceleration in one-step generation.FRIEREN is an efficient video-to-audio (V2A) generation model that leverages rectified flow matching to synthesize content-matching audio from silent video. The model regresses the conditional transport vector field from noise to spectrogram latent using straight paths and samples by solving ordinary differential equations (ODEs), outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, FRIEREN generates highly synchronized audio with the input video. Additionally, through reflow and one-step distillation with guided vector field, the model can generate decent audio in a few or even just one sampling step. Experiments on the VGGSound dataset show that FRIEREN achieves state-of-the-art performance in both generation quality and temporal alignment, with alignment accuracy reaching 97.22% and a 6.2% improvement in inception score over strong diffusion-based baselines. The model also demonstrates significant efficiency improvements, achieving a 7.3× speedup compared to Diff-Foley and a 9.3× acceleration in one-step generation.
Reach us at info@study.space
[slides] Frieren%3A Efficient Video-to-Audio Generation Network with Rectified Flow Matching | StudySpace