FRIEREN: Efficient Video-to-Audio Generation with Rectified Flow Matching

FRIEREN: Efficient Video-to-Audio Generation with Rectified Flow Matching

9 Jul 2024 | Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao
FRIEREN is an efficient video-to-audio generation model based on rectified flow matching. The model regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, the model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, the model can generate decent audio in a few, or even only one sampling step. Experiments indicate that FRIEREN achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io. The paper introduces FRIEREN, a video-to-audio generation model based on rectified flow matching. This method regresses the conditional transport vector field between noise and data distributions with as straight trajectories as possible, and conducts sampling by solving the corresponding ordinary differential equation (ODE). With simpler formulations, the rectified-flow-based model achieves higher audio quality and diversity. To improve temporal alignment, a non-autoregressive vector field estimator network with a feed-forward transformer is adopted, preserving temporal resolution. A channel-level cross-modal feature fusion mechanism is also employed for conditioning, leveraging the inherent alignment of audio-visual data and achieving strong alignment. These designs lead to high synchrony between generated audio and input video while upholding model simplicity. Moreover, through integrating reflow and one-step distillation techniques, the model can generate decent audio with a few, or even only one sampling step, significantly improving generation efficiency. FRIEREN outperforms strong baselines in terms of audio quality, generation efficiency, and temporal alignment on VGGSound, achieving a 6.2% improvement in inception score (IS) and a generation speed 7.3× that of Diff-Foley, as well as temporal alignment accuracy of up to 97.22% in 25 steps. Additionally, FRIEREN combining reflow and distillation achieves alignment accuracy of up to 97.85% with just one step, with a 9.3× acceleration compared to 25-step sampling, further boosting generation efficiency. The paper also discusses related works, including video-to-audio generation and flow matching generative models. It presents the method, including preliminary concepts of rectified flow matching, model architecture, re-weighting RFM objective with logit-normal coefficient, classifier-free guidance, reflow and one-step distillation with guided vector field, and experiments. The experiments show that FRIEREN significantly outperforms otherFRIEREN is an efficient video-to-audio generation model based on rectified flow matching. The model regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, the model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, the model can generate decent audio in a few, or even only one sampling step. Experiments indicate that FRIEREN achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io. The paper introduces FRIEREN, a video-to-audio generation model based on rectified flow matching. This method regresses the conditional transport vector field between noise and data distributions with as straight trajectories as possible, and conducts sampling by solving the corresponding ordinary differential equation (ODE). With simpler formulations, the rectified-flow-based model achieves higher audio quality and diversity. To improve temporal alignment, a non-autoregressive vector field estimator network with a feed-forward transformer is adopted, preserving temporal resolution. A channel-level cross-modal feature fusion mechanism is also employed for conditioning, leveraging the inherent alignment of audio-visual data and achieving strong alignment. These designs lead to high synchrony between generated audio and input video while upholding model simplicity. Moreover, through integrating reflow and one-step distillation techniques, the model can generate decent audio with a few, or even only one sampling step, significantly improving generation efficiency. FRIEREN outperforms strong baselines in terms of audio quality, generation efficiency, and temporal alignment on VGGSound, achieving a 6.2% improvement in inception score (IS) and a generation speed 7.3× that of Diff-Foley, as well as temporal alignment accuracy of up to 97.22% in 25 steps. Additionally, FRIEREN combining reflow and distillation achieves alignment accuracy of up to 97.85% with just one step, with a 9.3× acceleration compared to 25-step sampling, further boosting generation efficiency. The paper also discusses related works, including video-to-audio generation and flow matching generative models. It presents the method, including preliminary concepts of rectified flow matching, model architecture, re-weighting RFM objective with logit-normal coefficient, classifier-free guidance, reflow and one-step distillation with guided vector field, and experiments. The experiments show that FRIEREN significantly outperforms other
Reach us at info@study.space
[slides and audio] Frieren%3A Efficient Video-to-Audio Generation Network with Rectified Flow Matching