Understanding Embodied Understanding of Driving Scenarios

The paper introduces the Embodied Language Model (ELM), a comprehensive framework designed to enhance agents' understanding of driving scenarios with large spatial and temporal spans. ELM addresses the limitations of traditional Vision-Language Models (VLMs) by incorporating space-aware pre-training and time-aware token selection. The pre-training process involves collecting diverse data from various sources, including autonomous driving datasets, YouTube, and the Ego4D dataset, to enable robust spatial localization and temporal reasoning. The time-aware token selection module efficiently retrieves relevant information from long-term memory based on given instructions. The model is evaluated on a new benchmark comprising ten tasks that assess description, localization, memorization, and forecasting capabilities. ELM outperforms state-of-the-art VLMs in all tasks, demonstrating superior performance in long-range four-dimensional space understanding. The paper also includes ablation studies, out-of-distribution evaluations, and zero-shot learning experiments to validate the effectiveness and generalization of ELM.The paper introduces the Embodied Language Model (ELM), a comprehensive framework designed to enhance agents' understanding of driving scenarios with large spatial and temporal spans. ELM addresses the limitations of traditional Vision-Language Models (VLMs) by incorporating space-aware pre-training and time-aware token selection. The pre-training process involves collecting diverse data from various sources, including autonomous driving datasets, YouTube, and the Ego4D dataset, to enable robust spatial localization and temporal reasoning. The time-aware token selection module efficiently retrieves relevant information from long-term memory based on given instructions. The model is evaluated on a new benchmark comprising ten tasks that assess description, localization, memorization, and forecasting capabilities. ELM outperforms state-of-the-art VLMs in all tasks, demonstrating superior performance in long-range four-dimensional space understanding. The paper also includes ablation studies, out-of-distribution evaluations, and zero-shot learning experiments to validate the effectiveness and generalization of ELM.

Embodied Understanding of Driving Scenarios

7 Mar 2024 | Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li