15 Jul 2024 | Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà
The paper introduces Masked Generative Video-to-Audio Transformers (MaskVAT), a novel model designed to generate audio from silent videos, with a focus on enhancing temporal synchronization between the generated audio and the input video. MaskVAT combines a full-band high-quality audio codec, such as the Descript audio codec (DAC), with a sequence-to-sequence masked generative model. This integration allows for the simultaneous modeling of high audio quality, semantic matching, and temporal synchronicity. The model is trained using a variety of multi-modal audio-visual features, including pre-trained CLIP and S3D encoders, to ensure that the generated audio aligns well with the visual input. The paper evaluates MaskVAT on the VGGSound dataset and the MUSIC dataset, demonstrating superior performance in terms of audio quality, semantic matching, and temporal alignment compared to existing baselines. The results show that MaskVAT outperforms other models in both objective metrics and subjective evaluations, particularly in the alignment and overall quality categories.The paper introduces Masked Generative Video-to-Audio Transformers (MaskVAT), a novel model designed to generate audio from silent videos, with a focus on enhancing temporal synchronization between the generated audio and the input video. MaskVAT combines a full-band high-quality audio codec, such as the Descript audio codec (DAC), with a sequence-to-sequence masked generative model. This integration allows for the simultaneous modeling of high audio quality, semantic matching, and temporal synchronicity. The model is trained using a variety of multi-modal audio-visual features, including pre-trained CLIP and S3D encoders, to ensure that the generated audio aligns well with the visual input. The paper evaluates MaskVAT on the VGGSound dataset and the MUSIC dataset, demonstrating superior performance in terms of audio quality, semantic matching, and temporal alignment compared to existing baselines. The results show that MaskVAT outperforms other models in both objective metrics and subjective evaluations, particularly in the alignment and overall quality categories.