Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

6 Jun 2024 | Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Di Lin, Kaicheng Yu
This paper introduces Delphi, a novel diffusion-based method for generating long multi-view videos in autonomous driving scenarios. Delphi can generate up to 40 frames of temporally consistent multi-view videos, which is five times longer than state-of-the-art methods. The method introduces a shared noise modeling mechanism across multiple views to enhance spatial consistency and a feature-aligned module to achieve precise controllability and temporal consistency. Additionally, Delphi is paired with a failure-case driven framework that automatically enhances the generalization of end-to-end models by generating new data similar to failure cases. This framework leverages pre-trained visual language models to analyze implicit data patterns and retrieve similar data from existing training data. The proposed method significantly improves the performance of end-to-end autonomous driving models, achieving a 25% improvement in planning performance by generating only 4% of the training dataset. The method is evaluated on the nuScenes dataset, demonstrating superior quality and controllability of generated videos compared to existing methods. The results show that Delphi can generate high-quality long multi-view videos with spatiotemporal consistency and precise controllability, and that the failure-case driven framework significantly improves the generalization capability of end-to-end models at a low cost. The method also demonstrates the ability to generate diverse training data, which can be used to enhance the performance of end-to-end models. The paper concludes that Delphi is a promising approach for improving the performance of end-to-end autonomous driving models by generating long multi-view videos with high quality and controllability.This paper introduces Delphi, a novel diffusion-based method for generating long multi-view videos in autonomous driving scenarios. Delphi can generate up to 40 frames of temporally consistent multi-view videos, which is five times longer than state-of-the-art methods. The method introduces a shared noise modeling mechanism across multiple views to enhance spatial consistency and a feature-aligned module to achieve precise controllability and temporal consistency. Additionally, Delphi is paired with a failure-case driven framework that automatically enhances the generalization of end-to-end models by generating new data similar to failure cases. This framework leverages pre-trained visual language models to analyze implicit data patterns and retrieve similar data from existing training data. The proposed method significantly improves the performance of end-to-end autonomous driving models, achieving a 25% improvement in planning performance by generating only 4% of the training dataset. The method is evaluated on the nuScenes dataset, demonstrating superior quality and controllability of generated videos compared to existing methods. The results show that Delphi can generate high-quality long multi-view videos with spatiotemporal consistency and precise controllability, and that the failure-case driven framework significantly improves the generalization capability of end-to-end models at a low cost. The method also demonstrates the ability to generate diverse training data, which can be used to enhance the performance of end-to-end models. The paper concludes that Delphi is a promising approach for improving the performance of end-to-end autonomous driving models by generating long multi-view videos with high quality and controllability.
Reach us at info@study.space