Text-to-Audio Generation Synchronized with Videos

Text-to-Audio Generation Synchronized with Videos

8 Mar 2024 | Shentong Mo, Jing Shi, Yapeng Tian
This paper introduces T2AV-BENCH, a new benchmark for text-to-audio (TTA) generation aligned with videos, along with three novel metrics to evaluate visual alignment and temporal consistency. The authors also propose T2AV, a simple yet effective video-aligned TTA generation model based on latent diffusion models. T2AV integrates visual-aligned text embeddings as a conditional input, using a temporal multi-head attention transformer to extract temporal nuances from video data. It also employs an Audio-Visual ControlNet to merge temporal visual representations with text embeddings, and incorporates a contrastive learning objective to ensure that the visual-aligned text embeddings closely match the audio features. The model is evaluated on the AudioCaps and T2AV-BENCH datasets, demonstrating superior performance in terms of visual alignment and temporal consistency compared to previous baselines. The results show that T2AV generates high-fidelity audio that is well-aligned with the corresponding video content. The paper also includes extensive ablation studies that validate the importance of visual-aligned language-audio pre-training and Audio-Visual ControlNet in learning temporal-aware representations for maintaining visual alignment and temporal consistency. The authors conclude that their approach significantly improves the quality of video-aligned TTA generation.This paper introduces T2AV-BENCH, a new benchmark for text-to-audio (TTA) generation aligned with videos, along with three novel metrics to evaluate visual alignment and temporal consistency. The authors also propose T2AV, a simple yet effective video-aligned TTA generation model based on latent diffusion models. T2AV integrates visual-aligned text embeddings as a conditional input, using a temporal multi-head attention transformer to extract temporal nuances from video data. It also employs an Audio-Visual ControlNet to merge temporal visual representations with text embeddings, and incorporates a contrastive learning objective to ensure that the visual-aligned text embeddings closely match the audio features. The model is evaluated on the AudioCaps and T2AV-BENCH datasets, demonstrating superior performance in terms of visual alignment and temporal consistency compared to previous baselines. The results show that T2AV generates high-fidelity audio that is well-aligned with the corresponding video content. The paper also includes extensive ablation studies that validate the importance of visual-aligned language-audio pre-training and Audio-Visual ControlNet in learning temporal-aware representations for maintaining visual alignment and temporal consistency. The authors conclude that their approach significantly improves the quality of video-aligned TTA generation.
Reach us at info@study.space
[slides] Text-to-Audio Generation Synchronized with Videos | StudySpace