Embodied Understanding of Driving Scenarios

Embodied Understanding of Driving Scenarios

7 Mar 2024 | Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, and Hongyang Li
ELM is an embodied language model designed for understanding long-horizon driving scenarios in space and time. Unlike traditional vision-language models (VLMs), which are limited to 2D domain and lack spatial awareness and long-term forecasting capabilities, ELM incorporates space-aware pre-training to enhance spatial localization and time-aware token selection for accurate temporal cues. ELM is evaluated on a new benchmark with ten tasks, including description, localization, memorization, and forecasting. The model outperforms previous state-of-the-art approaches in all aspects. ELM is trained on a diverse dataset including nuScenes, Waymo, YouTube, and Ego4D, enabling it to handle complex driving scenarios. The model's performance is validated on various tasks, demonstrating its effectiveness in long-term event reasoning and forecasting. ELM's architecture includes pre-training on open-world data and fine-tuning on diverse tasks. The model's time-aware token selection module enables efficient retrieval of relevant information from long-term memory. ELM's results show significant improvements over other models, including LLaMA-Adapter V2, LLaVA, Otter, and VideoChat. The model's ability to handle long-horizon scenarios and its robust performance in various tasks highlight its potential for autonomous driving applications.ELM is an embodied language model designed for understanding long-horizon driving scenarios in space and time. Unlike traditional vision-language models (VLMs), which are limited to 2D domain and lack spatial awareness and long-term forecasting capabilities, ELM incorporates space-aware pre-training to enhance spatial localization and time-aware token selection for accurate temporal cues. ELM is evaluated on a new benchmark with ten tasks, including description, localization, memorization, and forecasting. The model outperforms previous state-of-the-art approaches in all aspects. ELM is trained on a diverse dataset including nuScenes, Waymo, YouTube, and Ego4D, enabling it to handle complex driving scenarios. The model's performance is validated on various tasks, demonstrating its effectiveness in long-term event reasoning and forecasting. ELM's architecture includes pre-training on open-world data and fine-tuning on diverse tasks. The model's time-aware token selection module enables efficient retrieval of relevant information from long-term memory. ELM's results show significant improvements over other models, including LLaMA-Adapter V2, LLaVA, Otter, and VideoChat. The model's ability to handle long-horizon scenarios and its robust performance in various tasks highlight its potential for autonomous driving applications.
Reach us at info@study.space