21 Aug 2024 | Haoxiang Guo, Zhongruo Wang, Yaqian Li, Kaiwen Long, Ming Yang, Yiqing Shen
This survey explores the application of foundation models in autonomous driving (AD). It reviews over 40 research papers, highlighting the role of foundation models in enhancing AD systems. Large language models (LLMs) contribute to planning and simulation in AD, particularly through their reasoning and code generation capabilities. Vision foundation models are increasingly used for tasks like 3D object detection and creating realistic driving scenarios. Multi-modal foundation models, integrating diverse inputs, offer superior visual understanding and spatial reasoning, crucial for end-to-end AD. The survey provides a structured taxonomy of foundation models based on their modalities and functions in AD, and discusses current research methods. It identifies gaps between existing foundation models and cutting-edge AD approaches, suggesting future research directions and a roadmap to bridge these gaps.
LLMs are used in AD for reasoning and planning, prediction, user interface and personalization, and simulation and testing. They can interpret environmental cues to make safe driving decisions, predict traffic participants' trajectories, understand and execute user commands, and generate realistic driving scenarios for simulation. Techniques like prompt engineering, in-context learning, and reinforcement learning from human feedback are used to adapt LLMs for AD. However, LLMs face challenges such as hallucination, latency, and dependency on perception systems.
Vision foundation models, like DINO and Segment Anything Model, are used for object detection and segmentation. They can generate realistic driving scenes and are applied in 3D perception and video generation. Multi-modal foundation models, such as CLIP and LLaVA, combine visual and textual information for better understanding and reasoning. They are used for visual understanding, unified perception and planning, and other tasks in AD. However, they face challenges such as hallucination and limited 3D perception.
The survey concludes that foundation models have significant potential in AD but require further research to address challenges like hallucination, latency, and domain gaps. Future directions include domain-specific pre-training, reinforcement learning, and improving 3D perception. The survey also highlights the need for large-scale datasets to improve the performance of foundation models in AD.This survey explores the application of foundation models in autonomous driving (AD). It reviews over 40 research papers, highlighting the role of foundation models in enhancing AD systems. Large language models (LLMs) contribute to planning and simulation in AD, particularly through their reasoning and code generation capabilities. Vision foundation models are increasingly used for tasks like 3D object detection and creating realistic driving scenarios. Multi-modal foundation models, integrating diverse inputs, offer superior visual understanding and spatial reasoning, crucial for end-to-end AD. The survey provides a structured taxonomy of foundation models based on their modalities and functions in AD, and discusses current research methods. It identifies gaps between existing foundation models and cutting-edge AD approaches, suggesting future research directions and a roadmap to bridge these gaps.
LLMs are used in AD for reasoning and planning, prediction, user interface and personalization, and simulation and testing. They can interpret environmental cues to make safe driving decisions, predict traffic participants' trajectories, understand and execute user commands, and generate realistic driving scenarios for simulation. Techniques like prompt engineering, in-context learning, and reinforcement learning from human feedback are used to adapt LLMs for AD. However, LLMs face challenges such as hallucination, latency, and dependency on perception systems.
Vision foundation models, like DINO and Segment Anything Model, are used for object detection and segmentation. They can generate realistic driving scenes and are applied in 3D perception and video generation. Multi-modal foundation models, such as CLIP and LLaVA, combine visual and textual information for better understanding and reasoning. They are used for visual understanding, unified perception and planning, and other tasks in AD. However, they face challenges such as hallucination and limited 3D perception.
The survey concludes that foundation models have significant potential in AD but require further research to address challenges like hallucination, latency, and domain gaps. Future directions include domain-specific pre-training, reinforcement learning, and improving 3D perception. The survey also highlights the need for large-scale datasets to improve the performance of foundation models in AD.