28 May 2025 | Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, Can Huang
MTVQA is a new benchmark for multilingual text-centric visual question answering (TECVQA), featuring high-quality human expert annotations in nine languages: Arabic, Korean, Japanese, Thai, Vietnamese, Russian, French, German, and Italian. It contains 6,778 question-answer pairs across 2,116 images, covering over 20 fine-grained scenarios from documents and natural scenes. The dataset was meticulously collected and annotated by human experts to ensure visual-textual alignment, with a two-round annotation process involving question generation and evaluation. The benchmark evaluates the performance of various multimodal large language models (MLLMs), including Qwen2.5-VL, InternVL-2.5, GPT-4o, GPT-4V, Claude3, and Gemini. Results show that even the top-performing MLLM, InternVL-2.5, scores significantly lower than human performance, highlighting the need for further improvements in multilingual text-centric visual question answering. The dataset provides nuanced multilingual annotations, aiming to set a new standard for benchmarks and foster advancements in multilingual visual text comprehension. MTVQA is released at https://github.com/bytedance/MTVQA.MTVQA is a new benchmark for multilingual text-centric visual question answering (TECVQA), featuring high-quality human expert annotations in nine languages: Arabic, Korean, Japanese, Thai, Vietnamese, Russian, French, German, and Italian. It contains 6,778 question-answer pairs across 2,116 images, covering over 20 fine-grained scenarios from documents and natural scenes. The dataset was meticulously collected and annotated by human experts to ensure visual-textual alignment, with a two-round annotation process involving question generation and evaluation. The benchmark evaluates the performance of various multimodal large language models (MLLMs), including Qwen2.5-VL, InternVL-2.5, GPT-4o, GPT-4V, Claude3, and Gemini. Results show that even the top-performing MLLM, InternVL-2.5, scores significantly lower than human performance, highlighting the need for further improvements in multilingual text-centric visual question answering. The dataset provides nuanced multilingual annotations, aiming to set a new standard for benchmarks and foster advancements in multilingual visual text comprehension. MTVQA is released at https://github.com/bytedance/MTVQA.