Understanding Read%2C Watch and Scream! Sound Generation from Text and Video

ReWaS is a novel video-and-text-to-sound generation method that integrates video as a conditional control for a text-to-audio generation model. The method estimates audio structural information (energy) from video while receiving key content cues from a user prompt. It employs a well-performing text-to-sound model to consolidate video control, which is more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. By separating the generative components of audio, the system becomes more flexible, allowing users to adjust energy, surrounding environment, and primary sound source according to their preferences. Experimental results show that ReWaS outperforms existing methods in terms of quality, controllability, and training efficiency. The method is evaluated on two datasets, VGGSound and GreatestHits, and demonstrates superior performance in audio generation, temporal alignment, and relevance to the video. ReWaS successfully generates realistic audio that aligns with both text and video inputs, and is able to handle complex scenarios such as short transitions in skateboarding videos. The method is also effective in generating audio for videos with ambiguous or redundant frames, and can be used for creating SFX, post-production for filmmaking, and utilizing AI-generated silent videos. The approach is based on a state-of-the-art text-to-audio generation method, AudioLDM, and introduces an energy adapter inspired by ControlNet to enable efficient training and control. The energy control is derived from the video and is used as a condition in the diffusion process to generate corresponding audio outputs. The method is trained on the VGGSound dataset and fine-tuned on the GreatestHits dataset to improve performance on these datasets. ReWaS is able to generate high-quality audio that is temporally aligned with the video and relevant to the text prompt. The method is evaluated using quantitative metrics such as FID, IS, and MKL, as well as qualitative assessments and human evaluations. The results show that ReWaS outperforms existing methods in all categories, demonstrating its effectiveness in generating high-quality, temporally aligned, and relevant audio for the given video.ReWaS is a novel video-and-text-to-sound generation method that integrates video as a conditional control for a text-to-audio generation model. The method estimates audio structural information (energy) from video while receiving key content cues from a user prompt. It employs a well-performing text-to-sound model to consolidate video control, which is more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. By separating the generative components of audio, the system becomes more flexible, allowing users to adjust energy, surrounding environment, and primary sound source according to their preferences. Experimental results show that ReWaS outperforms existing methods in terms of quality, controllability, and training efficiency. The method is evaluated on two datasets, VGGSound and GreatestHits, and demonstrates superior performance in audio generation, temporal alignment, and relevance to the video. ReWaS successfully generates realistic audio that aligns with both text and video inputs, and is able to handle complex scenarios such as short transitions in skateboarding videos. The method is also effective in generating audio for videos with ambiguous or redundant frames, and can be used for creating SFX, post-production for filmmaking, and utilizing AI-generated silent videos. The approach is based on a state-of-the-art text-to-audio generation method, AudioLDM, and introduces an energy adapter inspired by ControlNet to enable efficient training and control. The energy control is derived from the video and is used as a condition in the diffusion process to generate corresponding audio outputs. The method is trained on the VGGSound dataset and fine-tuned on the GreatestHits dataset to improve performance on these datasets. ReWaS is able to generate high-quality audio that is temporally aligned with the video and relevant to the text prompt. The method is evaluated using quantitative metrics such as FID, IS, and MKL, as well as qualitative assessments and human evaluations. The results show that ReWaS outperforms existing methods in all categories, demonstrating its effectiveness in generating high-quality, temporally aligned, and relevant audio for the given video.

Read, Watch and Scream! Sound Generation from Text and Video

8 Jul 2024 | Yujin Jeong*, Yunji Kim, Sanghyuk Chun, Jiyoung Lee†