5 Jun 2024 | Yidong Huang, Jacob Sansom, Ziqiao Ma, Felix Gervits, Joyce Chai
DriVLM: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences
**Abstract:**
Recent advancements in foundation models (FMs) have opened new avenues for autonomous driving, but current experimental settings are often oversimplified and fail to capture the complexity of real-world driving scenarios. This paper introduces DriVLM, a video-language-model-based agent designed to facilitate natural and effective communication between humans and autonomous vehicles. DriVLM is trained on both embodied experiences in a simulated environment and social experiences from real human dialogue. While DriVLM demonstrates competitive performance in open-loop benchmarks and closed-loop human studies, it also reveals several limitations, including unacceptable inference time, imbalanced training data, limited visual understanding, challenges with multi-turn interactions, simplified language generation, and difficulties in handling unexpected situations.
**Introduction:**
Autonomous driving (AD) has made significant progress, but effective human-agent dialogue and collaboration remain crucial for passenger safety, trustworthiness, and enhancing the driving experience. Traditional rule-based approaches struggle with natural language complexity, while data-driven learning-based approaches offer promising results. FMs like LLMs have shown potential in autonomous driving, but their experimental setups are often preliminary and simplified. DriVLM aims to address these limitations by leveraging both embodied and social experiences.
**Method:**
DriVLMMe is a large video-language model consisting of a video tokenizer, a route planning module, and a large language model backbone. The model is trained using domain video instruction tuning and social and embodied instruction tuning. The video tokenizer processes visual observations, the route planner assists in finding the shortest path, and the LLM decoder processes the input and generates responses and actions.
**Evaluation:**
DriVLMMe is evaluated on the Situated Dialogue Navigation (SDN) benchmark and the BDD-X dataset. Open-loop evaluations show significant improvements over baselines in dialogue response generation and physical action prediction. Closed-loop evaluations in CARLA simulations reveal robustness under various dynamic scenarios, including weather changes, goal changes, and obstacle additions. However, challenges remain in multi-turn interactions, handling unexpected situations, and generating complex language responses.
**Limitations and Future Work:**
DriVLMMe faces limitations such as imbalanced training data, limited visual understanding, and unacceptable inference time. Future work should focus on enhancing world modeling, improving visual understanding, addressing unexpected situations, and reducing inference time.
**Conclusion:**
DriVLMMe demonstrates the potential of LLMs in autonomous driving dialogue tasks, but further research is needed to address its limitations and enhance its capabilities.DriVLM: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences
**Abstract:**
Recent advancements in foundation models (FMs) have opened new avenues for autonomous driving, but current experimental settings are often oversimplified and fail to capture the complexity of real-world driving scenarios. This paper introduces DriVLM, a video-language-model-based agent designed to facilitate natural and effective communication between humans and autonomous vehicles. DriVLM is trained on both embodied experiences in a simulated environment and social experiences from real human dialogue. While DriVLM demonstrates competitive performance in open-loop benchmarks and closed-loop human studies, it also reveals several limitations, including unacceptable inference time, imbalanced training data, limited visual understanding, challenges with multi-turn interactions, simplified language generation, and difficulties in handling unexpected situations.
**Introduction:**
Autonomous driving (AD) has made significant progress, but effective human-agent dialogue and collaboration remain crucial for passenger safety, trustworthiness, and enhancing the driving experience. Traditional rule-based approaches struggle with natural language complexity, while data-driven learning-based approaches offer promising results. FMs like LLMs have shown potential in autonomous driving, but their experimental setups are often preliminary and simplified. DriVLM aims to address these limitations by leveraging both embodied and social experiences.
**Method:**
DriVLMMe is a large video-language model consisting of a video tokenizer, a route planning module, and a large language model backbone. The model is trained using domain video instruction tuning and social and embodied instruction tuning. The video tokenizer processes visual observations, the route planner assists in finding the shortest path, and the LLM decoder processes the input and generates responses and actions.
**Evaluation:**
DriVLMMe is evaluated on the Situated Dialogue Navigation (SDN) benchmark and the BDD-X dataset. Open-loop evaluations show significant improvements over baselines in dialogue response generation and physical action prediction. Closed-loop evaluations in CARLA simulations reveal robustness under various dynamic scenarios, including weather changes, goal changes, and obstacle additions. However, challenges remain in multi-turn interactions, handling unexpected situations, and generating complex language responses.
**Limitations and Future Work:**
DriVLMMe faces limitations such as imbalanced training data, limited visual understanding, and unacceptable inference time. Future work should focus on enhancing world modeling, improving visual understanding, addressing unexpected situations, and reducing inference time.
**Conclusion:**
DriVLMMe demonstrates the potential of LLMs in autonomous driving dialogue tasks, but further research is needed to address its limitations and enhance its capabilities.