[slides] FoleyCrafter%3A Bring Silent Videos to Life with Lifelike and Synchronized Sounds

FoleyCrafter is a novel framework for generating high-quality, video-aligned sound effects for silent videos. It leverages a pre-trained text-to-audio model and integrates two key components: a semantic adapter for semantic alignment and a temporal controller for precise audio-video synchronization. The semantic adapter uses parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. The temporal controller incorporates an onset detector and a timestamp-based adapter to achieve precise audio-video alignment. FoleyCrafter is compatible with text prompts, enabling controllable and diverse video-to-audio generation according to user intents. The framework was evaluated on standard benchmarks, demonstrating state-of-the-art performance in terms of audio quality and video alignment. The semantic adapter and temporal controller were trained with video-audio corresponding data, while the text-to-audio base model remained fixed to preserve its established audio generation quality. FoleyCrafter can generate high-quality sounds for videos with semantic and temporal alignment in a flexible and controllable way. The framework was tested on various video genres, including realistic videos, games, and animations, showing its excellent video-to-audio generation capabilities. The results of the user study indicate that FoleyCrafter is preferred in all three metrics: semantic alignment, temporal alignment, and generation quality. The framework is supported by the National Key R&D Program of China and partially by NSFC and Shenzhen Science and Technology Program.FoleyCrafter is a novel framework for generating high-quality, video-aligned sound effects for silent videos. It leverages a pre-trained text-to-audio model and integrates two key components: a semantic adapter for semantic alignment and a temporal controller for precise audio-video synchronization. The semantic adapter uses parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. The temporal controller incorporates an onset detector and a timestamp-based adapter to achieve precise audio-video alignment. FoleyCrafter is compatible with text prompts, enabling controllable and diverse video-to-audio generation according to user intents. The framework was evaluated on standard benchmarks, demonstrating state-of-the-art performance in terms of audio quality and video alignment. The semantic adapter and temporal controller were trained with video-audio corresponding data, while the text-to-audio base model remained fixed to preserve its established audio generation quality. FoleyCrafter can generate high-quality sounds for videos with semantic and temporal alignment in a flexible and controllable way. The framework was tested on various video genres, including realistic videos, games, and animations, showing its excellent video-to-audio generation capabilities. The results of the user study indicate that FoleyCrafter is preferred in all three metrics: semantic alignment, temporal alignment, and generation quality. The framework is supported by the National Key R&D Program of China and partially by NSFC and Shenzhen Science and Technology Program.

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

1 Jul 2024 | Yiming Zhang¹, Yicheng Gu², Yanhong Zeng¹,‡, Zhening Xing¹, Yuancheng Wang², Zhizheng Wu²,¹, Kai Chen¹,‡