10 Mar 2024 | Deshun Yang*, Luhui Hu*, Yu Tian*, Zihao Li, Chris Kelly, Bang Yang, Cindy Yang, Yuexian Zou
WorldGPT is a video generation AI agent inspired by Sora, capable of creating high-quality videos from text and image inputs. The system integrates a prompt enhancer and full video translation framework. The prompt enhancer uses ChatGPT to refine and generate precise prompts for each step, ensuring accurate communication and execution. The full video translation part uses advanced diffusion techniques to generate and refine key frames, which are then used to create videos with enhanced temporal consistency and smooth action.
The system addresses challenges such as temporal-spatial consistency, diversity and creativity, and video inference evaluation. It leverages dynamic scene modeling, spatiotemporal prediction, and multimodal fusion to generate diverse video sequences from text and image inputs. The framework includes a key frame generator and video generator, with the key frame generator using GroundingDino and Stable Diffusion to create key frames based on text prompts.
The video generator uses DynamiCrafter to generate a seamless sequence of frames based on the provided starting and ending frames. It then refines the video using ChatGPT-generated background details and requirements. The system's workflow ensures a fluid transformation from concept to reality, allowing for high-quality video generation with both preset ending and individual expression.
Experimental results show that WorldGPT outperforms other state-of-the-art models in terms of control-video alignment, motion effects, and temporal consistency. It also demonstrates superior performance in handling complex textual inputs and generating videos that reflect the textual input accurately. Human evaluation further confirms that WorldGPT provides comparable visual quality to other models but excels in motion quality and text-video alignment. The system's controllability and adaptability make it suitable for various applications, including automated content creation and media storytelling.WorldGPT is a video generation AI agent inspired by Sora, capable of creating high-quality videos from text and image inputs. The system integrates a prompt enhancer and full video translation framework. The prompt enhancer uses ChatGPT to refine and generate precise prompts for each step, ensuring accurate communication and execution. The full video translation part uses advanced diffusion techniques to generate and refine key frames, which are then used to create videos with enhanced temporal consistency and smooth action.
The system addresses challenges such as temporal-spatial consistency, diversity and creativity, and video inference evaluation. It leverages dynamic scene modeling, spatiotemporal prediction, and multimodal fusion to generate diverse video sequences from text and image inputs. The framework includes a key frame generator and video generator, with the key frame generator using GroundingDino and Stable Diffusion to create key frames based on text prompts.
The video generator uses DynamiCrafter to generate a seamless sequence of frames based on the provided starting and ending frames. It then refines the video using ChatGPT-generated background details and requirements. The system's workflow ensures a fluid transformation from concept to reality, allowing for high-quality video generation with both preset ending and individual expression.
Experimental results show that WorldGPT outperforms other state-of-the-art models in terms of control-video alignment, motion effects, and temporal consistency. It also demonstrates superior performance in handling complex textual inputs and generating videos that reflect the textual input accurately. Human evaluation further confirms that WorldGPT provides comparable visual quality to other models but excels in motion quality and text-video alignment. The system's controllability and adaptability make it suitable for various applications, including automated content creation and media storytelling.