3 Jun 2024 | Yuanxin Liu*, Shicheng Li*, Yi Liu*, Yuxiang Wang*, Shuhuai Ren*, Lei Li†, Sishuo Chen‡, Xu Sun*, Lu Hou‡
The paper "TempCompass: Do Video LLMs Really Understand Videos?" addresses the lack of comprehensive evaluation methods for assessing the temporal perception ability of Video Large Language Models (Video LLMs). Existing benchmarks fail to provide a nuanced understanding of how Video LLMs perform on different temporal aspects and task formats. To address this, the authors propose TempCompass, a benchmark that introduces a diverse range of temporal aspects (Action, Speed, Direction, Attribute Change, and Event Order) and task formats (Multi-Choice QA, Yes/No QA, Caption Matching, and Caption Generation). They collect high-quality test data by constructing conflicting videos and using a combination of human annotation and LLM-generated instructions. The evaluation method combines rule-based and LLM-based approaches to assess the responses from Video LLMs. The results, based on TempCompass, reveal that state-of-the-art Video LLMs exhibit poor temporal perception abilities, often performing worse than Image LLMs. The paper also highlights the need for diverse task formats in the assessment process to better evaluate the temporal perception capabilities of Video LLMs.The paper "TempCompass: Do Video LLMs Really Understand Videos?" addresses the lack of comprehensive evaluation methods for assessing the temporal perception ability of Video Large Language Models (Video LLMs). Existing benchmarks fail to provide a nuanced understanding of how Video LLMs perform on different temporal aspects and task formats. To address this, the authors propose TempCompass, a benchmark that introduces a diverse range of temporal aspects (Action, Speed, Direction, Attribute Change, and Event Order) and task formats (Multi-Choice QA, Yes/No QA, Caption Matching, and Caption Generation). They collect high-quality test data by constructing conflicting videos and using a combination of human annotation and LLM-generated instructions. The evaluation method combines rule-based and LLM-based approaches to assess the responses from Video LLMs. The results, based on TempCompass, reveal that state-of-the-art Video LLMs exhibit poor temporal perception abilities, often performing worse than Image LLMs. The paper also highlights the need for diverse task formats in the assessment process to better evaluate the temporal perception capabilities of Video LLMs.