Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

6 Jun 2024 | Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Junpeng Jiang, Kun Zhan, Peng Jia, Xianpeng Lang, Haiyang Sun, Di Lin, Kaicheng Yu
The paper introduces *Delphi*, a novel diffusion-based long video generation method designed to address the challenges of spatial-temporal consistency and precise controllability in generating multi-view videos for end-to-end autonomous driving models. *Delphi* can generate up to 40 frames of videos, significantly longer than existing methods, which typically generate only 8 frames. The method includes a Noise Reinitialization Module (NRM) to model shared noise across frames and a Feature-Aligned Temporal Consistency Module (FTCM) to ensure spatial and temporal consistency. To enhance the generalization of end-to-end models, the paper proposes a failure-case driven framework. This framework evaluates the end-to-end model, collects failure cases, analyzes data patterns using pre-trained visual language models, retrieves similar scenes from the training data, generates diverse training data with *Delphi*, and updates the end-to-end model. Extensive experiments on the nuScenes dataset demonstrate that *Delphi* generates high-quality long multi-view videos with spatiotemporal consistency and precise controllability. The failure-case driven framework improves the end-to-end model's performance by 25% with only 4% of the training dataset size, making it the first method to boost planning performance beyond perception tasks in end-to-end autonomous driving.The paper introduces *Delphi*, a novel diffusion-based long video generation method designed to address the challenges of spatial-temporal consistency and precise controllability in generating multi-view videos for end-to-end autonomous driving models. *Delphi* can generate up to 40 frames of videos, significantly longer than existing methods, which typically generate only 8 frames. The method includes a Noise Reinitialization Module (NRM) to model shared noise across frames and a Feature-Aligned Temporal Consistency Module (FTCM) to ensure spatial and temporal consistency. To enhance the generalization of end-to-end models, the paper proposes a failure-case driven framework. This framework evaluates the end-to-end model, collects failure cases, analyzes data patterns using pre-trained visual language models, retrieves similar scenes from the training data, generates diverse training data with *Delphi*, and updates the end-to-end model. Extensive experiments on the nuScenes dataset demonstrate that *Delphi* generates high-quality long multi-view videos with spatiotemporal consistency and precise controllability. The failure-case driven framework improves the end-to-end model's performance by 25% with only 4% of the training dataset size, making it the first method to boost planning performance beyond perception tasks in end-to-end autonomous driving.
Reach us at info@study.space