11 Apr 2024 | Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xingang Wang
DriveDreamer-2 is a world model that generates user-customized driving videos by integrating a Large Language Model (LLM) to create driving scenarios based on user descriptions. It first converts user queries into agent trajectories, then generates HDMaps adhering to traffic regulations. The Unified Multi-View Video Model (UniMVM) is introduced to enhance temporal and spatial coherence in the generated videos. DriveDreamer-2 produces high-quality driving videos with FID of 11.2 and FVD of 55.7, representing improvements of ~30% and ~50% over previous methods. The generated videos improve the training of driving perception methods such as 3D detection and tracking. The model also generates uncommon driving scenarios, such as vehicles abruptly cutting in, demonstrating its ability to create diverse and realistic driving videos. The model's framework includes a customized traffic simulation pipeline that uses text prompts to generate diverse traffic conditions. The HDMap generator uses diffusion models to simulate road structures, and the UniMVM ensures consistency across multiple views. Extensive experiments show that DriveDreamer-2 produces high-quality driving videos that enhance the performance of autonomous driving perception methods. The model's effectiveness is validated through quantitative metrics and visual comparisons with other methods.DriveDreamer-2 is a world model that generates user-customized driving videos by integrating a Large Language Model (LLM) to create driving scenarios based on user descriptions. It first converts user queries into agent trajectories, then generates HDMaps adhering to traffic regulations. The Unified Multi-View Video Model (UniMVM) is introduced to enhance temporal and spatial coherence in the generated videos. DriveDreamer-2 produces high-quality driving videos with FID of 11.2 and FVD of 55.7, representing improvements of ~30% and ~50% over previous methods. The generated videos improve the training of driving perception methods such as 3D detection and tracking. The model also generates uncommon driving scenarios, such as vehicles abruptly cutting in, demonstrating its ability to create diverse and realistic driving videos. The model's framework includes a customized traffic simulation pipeline that uses text prompts to generate diverse traffic conditions. The HDMap generator uses diffusion models to simulate road structures, and the UniMVM ensures consistency across multiple views. Extensive experiments show that DriveDreamer-2 produces high-quality driving videos that enhance the performance of autonomous driving perception methods. The model's effectiveness is validated through quantitative metrics and visual comparisons with other methods.