[slides] Vidu%3A a Highly Consistent%2C Dynamic and Skilled Text-to-Video Generator with Diffusion Models

Vidu is a high-performance text-to-video generator that can produce 1080p videos up to 16 seconds in a single generation. It is a diffusion model with U-ViT as its backbone, which enables scalability and the ability to handle long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora – the most powerful reported text-to-video generator. Vidu is also capable of generating videos of different lengths, with 3D consistency, incorporating cuts, transitions, camera movements, lighting effects, and emotional portrayal. It can also generate imaginative scenes that do not exist in the real world. Vidu is compared with Sora, and the generation performance of Vidu is found to be comparable to Sora. Additionally, Vidu is capable of other controllable video generation, including canny-to-video generation, video prediction, and subject-driven generation, which demonstrate promising results. Vidu is trained on vast amounts of text-video pairs, and it is infeasible to have all videos labeled by humans. To address this, a high-performance video captioner is trained and used to automatically annotate the training videos. During inference, the re-captioning technique is applied to rephrase user inputs into a form that is more suitable for the model. Vidu is also capable of generating videos with impressive lighting effects and emotional portrayal. It is able to depict characters' emotions effectively. Vidu is able to generate videos with camera movements including zoom, pan, and dolly. It is also capable of generating videos with transitions in a single generation. Vidu is able to generate videos with 3D consistency, as the camera rotates, the video presents projections of the same object from different angles. Vidu is able to generate videos with cuts, as shown in Figure 4, these videos present different perspectives of the same scene by switching camera angles, while maintaining consistency of subjects in the scene. Vidu is able to generate videos with transitions in a single generation, as shown in Figure 5, these transitions can connect two different scenes in an engaging manner. Vidu is able to generate videos with camera movements including zoom, pan, and dolly. It is also capable of generating videos with transitions in a single generation. Vidu is able to generate videos with lighting effects, which help enhance the overall atmosphere. For example, as shown in Figure 7, the generated videos can evoke atmospheres of mystery and tranquility. Therefore, besides the entities within the video content, Vidu has the preliminary ability to convey some abstract feelings. Vidu is able to generate videos with emotional portrayal, as shown in Figure 8, Vidu can express emotions such as happiness, loneliness, embarrassment, and joy. Vidu is also able to generate imaginative scenes that do not exist in the real world, as shown in Figure 9. VidVidu is a high-performance text-to-video generator that can produce 1080p videos up to 16 seconds in a single generation. It is a diffusion model with U-ViT as its backbone, which enables scalability and the ability to handle long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora – the most powerful reported text-to-video generator. Vidu is also capable of generating videos of different lengths, with 3D consistency, incorporating cuts, transitions, camera movements, lighting effects, and emotional portrayal. It can also generate imaginative scenes that do not exist in the real world. Vidu is compared with Sora, and the generation performance of Vidu is found to be comparable to Sora. Additionally, Vidu is capable of other controllable video generation, including canny-to-video generation, video prediction, and subject-driven generation, which demonstrate promising results. Vidu is trained on vast amounts of text-video pairs, and it is infeasible to have all videos labeled by humans. To address this, a high-performance video captioner is trained and used to automatically annotate the training videos. During inference, the re-captioning technique is applied to rephrase user inputs into a form that is more suitable for the model. Vidu is also capable of generating videos with impressive lighting effects and emotional portrayal. It is able to depict characters' emotions effectively. Vidu is able to generate videos with camera movements including zoom, pan, and dolly. It is also capable of generating videos with transitions in a single generation. Vidu is able to generate videos with 3D consistency, as the camera rotates, the video presents projections of the same object from different angles. Vidu is able to generate videos with cuts, as shown in Figure 4, these videos present different perspectives of the same scene by switching camera angles, while maintaining consistency of subjects in the scene. Vidu is able to generate videos with transitions in a single generation, as shown in Figure 5, these transitions can connect two different scenes in an engaging manner. Vidu is able to generate videos with camera movements including zoom, pan, and dolly. It is also capable of generating videos with transitions in a single generation. Vidu is able to generate videos with lighting effects, which help enhance the overall atmosphere. For example, as shown in Figure 7, the generated videos can evoke atmospheres of mystery and tranquility. Therefore, besides the entities within the video content, Vidu has the preliminary ability to convey some abstract feelings. Vidu is able to generate videos with emotional portrayal, as shown in Figure 8, Vidu can express emotions such as happiness, loneliness, embarrassment, and joy. Vidu is also able to generate imaginative scenes that do not exist in the real world, as shown in Figure 9. Vid

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

7 May 2024 | Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, Jun Zhu