[slides] Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

MaskVAT is a video-to-audio (V2A) generative model that combines a full-band high-quality audio codec with a sequence-to-sequence masked generative model to achieve high audio quality, semantic matching, and temporal synchronicity. The model uses pre-trained audio-visual features and a sequence-to-sequence parallel structure to generate highly synchronized audio from video inputs. It addresses the challenge of aligning generated audio with visual content by employing a sequence-to-sequence model architecture with a regularization loss to ensure synchronization during generation, using pre-trained synchronicity features and a post-sampling selection model. MaskVAT outperforms existing models in terms of audio quality, semantic relevance, and temporal alignment, particularly in the context of full-band audio generation. The model is evaluated on the VGGSound and MUSIC datasets, showing strong performance in both objective and subjective metrics. It demonstrates competitive performance against state-of-the-art models, with a focus on achieving high-quality, temporally aligned audio generation. The model's effectiveness is attributed to its integration of a high-quality audio codec with a masked generative approach, enabling it to produce audio that is both high-quality and temporally synchronized with the input video.MaskVAT is a video-to-audio (V2A) generative model that combines a full-band high-quality audio codec with a sequence-to-sequence masked generative model to achieve high audio quality, semantic matching, and temporal synchronicity. The model uses pre-trained audio-visual features and a sequence-to-sequence parallel structure to generate highly synchronized audio from video inputs. It addresses the challenge of aligning generated audio with visual content by employing a sequence-to-sequence model architecture with a regularization loss to ensure synchronization during generation, using pre-trained synchronicity features and a post-sampling selection model. MaskVAT outperforms existing models in terms of audio quality, semantic relevance, and temporal alignment, particularly in the context of full-band audio generation. The model is evaluated on the VGGSound and MUSIC datasets, showing strong performance in both objective and subjective metrics. It demonstrates competitive performance against state-of-the-art models, with a focus on achieving high-quality, temporally aligned audio generation. The model's effectiveness is attributed to its integration of a high-quality audio codec with a masked generative approach, enabling it to produce audio that is both high-quality and temporally synchronized with the input video.

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

15 Jul 2024 | Santiago Pascual, Chungsin Yeh, Ioannis Tsiamas, and Joan Serrà