The paper introduces CODA-LM, a benchmark for automatically evaluating Large Vision-Language Models (LVLMs) on self-driving corner cases. Existing evaluations of LVLMs focus on natural circumstances, lacking automated and quantifiable assessments for severe road corner cases. CODA-LM addresses this gap by using a hierarchical data structure to guide LVLMs in analyzing complex driving scenes and generating high-quality pre-annotations for human annotators. The paper demonstrates that text-only large language models (LLMs) can serve as effective judges for evaluating LVLMs, showing stronger consistency with human preferences compared to LVLM judges. Additionally, the authors propose CODA-VLM, a new driving LVLM that outperforms all open-sourced counterparts on CODA-LM, even surpassing GPT-4V by +21.42% on the regional perception task. The main contributions include the introduction of CODA-LM, the effectiveness of text-only LLMs as judges, and the state-of-the-art performance of CODA-VLM. The paper also provides a detailed construction of the CODA-LM dataset, evaluation framework, and ablation studies to validate the effectiveness of the proposed methods.The paper introduces CODA-LM, a benchmark for automatically evaluating Large Vision-Language Models (LVLMs) on self-driving corner cases. Existing evaluations of LVLMs focus on natural circumstances, lacking automated and quantifiable assessments for severe road corner cases. CODA-LM addresses this gap by using a hierarchical data structure to guide LVLMs in analyzing complex driving scenes and generating high-quality pre-annotations for human annotators. The paper demonstrates that text-only large language models (LLMs) can serve as effective judges for evaluating LVLMs, showing stronger consistency with human preferences compared to LVLM judges. Additionally, the authors propose CODA-VLM, a new driving LVLM that outperforms all open-sourced counterparts on CODA-LM, even surpassing GPT-4V by +21.42% on the regional perception task. The main contributions include the introduction of CODA-LM, the effectiveness of text-only LLMs as judges, and the state-of-the-art performance of CODA-VLM. The paper also provides a detailed construction of the CODA-LM dataset, evaluation framework, and ablation studies to validate the effectiveness of the proposed methods.