3 Jun 2024 | Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou
TempCompass is a benchmark designed to evaluate the temporal perception ability of Video Large Language Models (Video LLMs). The benchmark introduces diverse temporal aspects and task formats to comprehensively assess how well Video LLMs understand video content over time. TempCompass includes five basic temporal aspects (Action, Speed, Direction, Attribute Change, Event Order) and ten fine-grained sub-aspects, along with four types of task formats (Multi-Choice QA, Yes/No QA, Caption Matching, Caption Generation). To ensure fairness, conflicting videos are constructed that share the same static content but differ in specific temporal aspects, preventing Video LLMs from relying on single-frame bias or language priors. Task instructions are collected through a combination of human annotation and LLM generation. An LLM-based evaluation method is also proposed to automatically assess Video LLM responses. TempCompass evaluates 8 state-of-the-art Video LLMs and 3 Image LLMs, revealing that Video LLMs significantly lack temporal perception ability compared to Image LLMs. The benchmark also highlights the variability of temporal perception ability across different task formats, emphasizing the need for diverse task formats in evaluation. The data and evaluation code are available at https://github.com/llyx97/TempCompass.TempCompass is a benchmark designed to evaluate the temporal perception ability of Video Large Language Models (Video LLMs). The benchmark introduces diverse temporal aspects and task formats to comprehensively assess how well Video LLMs understand video content over time. TempCompass includes five basic temporal aspects (Action, Speed, Direction, Attribute Change, Event Order) and ten fine-grained sub-aspects, along with four types of task formats (Multi-Choice QA, Yes/No QA, Caption Matching, Caption Generation). To ensure fairness, conflicting videos are constructed that share the same static content but differ in specific temporal aspects, preventing Video LLMs from relying on single-frame bias or language priors. Task instructions are collected through a combination of human annotation and LLM generation. An LLM-based evaluation method is also proposed to automatically assess Video LLM responses. TempCompass evaluates 8 state-of-the-art Video LLMs and 3 Image LLMs, revealing that Video LLMs significantly lack temporal perception ability compared to Image LLMs. The benchmark also highlights the variability of temporal perception ability across different task formats, emphasizing the need for diverse task formats in evaluation. The data and evaluation code are available at https://github.com/llyx97/TempCompass.