Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

27 Jun 2024 | Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, Dit-Yan Yeung, Huchuan Lu, Xu Jia
This paper introduces CODA-LM, the first benchmark for automatically evaluating Large Vision-Language Models (LVLMs) on self-driving corner cases. CODA-LM is a large-scale multimodal dataset for autonomous driving, featuring a hierarchical evaluation framework. It includes 9,768 real-world driving scenarios with detailed annotations for critical road entities and corner cases. The dataset is structured into three tasks: general perception, regional perception, and driving suggestions. To generate high-quality pre-annotations, a hierarchical data structure is used to guide GPT-4V to analyze complex driving scenes and produce structured responses, which are then converted into coherent texts for human verification. The study shows that using text-only large language models (LLMs) as judges reveals better alignment with human preferences than LVLM judges. CODA-VLM, a new driving LVLM, is proposed and achieves state-of-the-art performance on CODA-LM, surpassing GPT-4V by +21.42% on the regional perception task. The paper also presents a comprehensive evaluation of existing LVLMs on self-driving corner cases, demonstrating the effectiveness of CODA-LM in assessing LVLMs. The evaluation framework includes a text-only LLM judge, which shows superior consistency with human judgments. The study also explores the impact of visual prompts and the effectiveness of different components of CODA-VLM. The results show that CODA-VLM achieves a better balance between efficiency and performance. The paper concludes that CODA-LM provides a valuable benchmark for evaluating LVLMs in self-driving scenarios, and it hopes to promote the development of reliable and interpretable autonomous driving systems. The limitations of CODA-LM include the potential lack of coverage of all possible unexpected conditions in driving scenarios and the need for further exploration of controllable generation and automatic data calibration methods.This paper introduces CODA-LM, the first benchmark for automatically evaluating Large Vision-Language Models (LVLMs) on self-driving corner cases. CODA-LM is a large-scale multimodal dataset for autonomous driving, featuring a hierarchical evaluation framework. It includes 9,768 real-world driving scenarios with detailed annotations for critical road entities and corner cases. The dataset is structured into three tasks: general perception, regional perception, and driving suggestions. To generate high-quality pre-annotations, a hierarchical data structure is used to guide GPT-4V to analyze complex driving scenes and produce structured responses, which are then converted into coherent texts for human verification. The study shows that using text-only large language models (LLMs) as judges reveals better alignment with human preferences than LVLM judges. CODA-VLM, a new driving LVLM, is proposed and achieves state-of-the-art performance on CODA-LM, surpassing GPT-4V by +21.42% on the regional perception task. The paper also presents a comprehensive evaluation of existing LVLMs on self-driving corner cases, demonstrating the effectiveness of CODA-LM in assessing LVLMs. The evaluation framework includes a text-only LLM judge, which shows superior consistency with human judgments. The study also explores the impact of visual prompts and the effectiveness of different components of CODA-VLM. The results show that CODA-VLM achieves a better balance between efficiency and performance. The paper concludes that CODA-LM provides a valuable benchmark for evaluating LVLMs in self-driving scenarios, and it hopes to promote the development of reliable and interpretable autonomous driving systems. The limitations of CODA-LM include the potential lack of coverage of all possible unexpected conditions in driving scenarios and the need for further exploration of controllable generation and automatic data calibration methods.
Reach us at info@study.space