22 Mar 2024 | Zhengqing Yuan1*, Ruoxi Chen1*, Zhaoxu Li1*, Haolong Jia1*, Lifang He1, Chi Wang2, Lichao Sun1
Mora is a novel multi-agent framework designed to enable generalist video generation, inspired by OpenAI's Sora. Unlike Sora, which is closed-source, Mora is open-source and aims to replicate Sora's capabilities in various video generation tasks. The framework integrates multiple advanced visual AI agents to handle different aspects of video generation, including text-to-video, text-conditional image-to-video, video extension, video editing, video connection, and digital world simulation. Mora's performance is evaluated through a series of tasks and metrics, showing that it can achieve comparable or superior results in many areas compared to Sora. However, there are still performance gaps, particularly in video quality and length, and in interpreting specific motion instructions. The paper discusses the strengths and limitations of Mora, emphasizing its flexibility, adaptability, and open-source contribution, while also highlighting the need for improved video datasets and advancements in rendering capabilities.Mora is a novel multi-agent framework designed to enable generalist video generation, inspired by OpenAI's Sora. Unlike Sora, which is closed-source, Mora is open-source and aims to replicate Sora's capabilities in various video generation tasks. The framework integrates multiple advanced visual AI agents to handle different aspects of video generation, including text-to-video, text-conditional image-to-video, video extension, video editing, video connection, and digital world simulation. Mora's performance is evaluated through a series of tasks and metrics, showing that it can achieve comparable or superior results in many areas compared to Sora. However, there are still performance gaps, particularly in video quality and length, and in interpreting specific motion instructions. The paper discusses the strengths and limitations of Mora, emphasizing its flexibility, adaptability, and open-source contribution, while also highlighting the need for improved video datasets and advancements in rendering capabilities.