Understanding MTVQA%3A Benchmarking Multilingual Text-Centric Visual Question Answering

The paper introduces MTVQA, a benchmark for Multilingual Text-Centric Visual Question Answering (TEC-VQA) that features high-quality human expert annotations across nine diverse languages. The benchmark aims to address the "visual-textual misalignment" problem in multilingual TEC-VQA, where translation-based protocols often overlook visual text in images. MTVQA consists of 8,794 images and 28,607 question-answer pairs, covering a wide range of scenarios from documents and natural scenes. The authors evaluate various state-of-the-art Multimodal Large Language Models (MLLMs) on MTVQA and find significant room for improvement, highlighting the need for advancements in multilingual visual text comprehension. The paper also discusses the limitations of existing datasets and models, emphasizing the importance of comprehensive and diverse language coverage. The MTVQA dataset is released under the CC BY-NC 4.0 license, aiming to inspire further research in the TEC-VQA community.The paper introduces MTVQA, a benchmark for Multilingual Text-Centric Visual Question Answering (TEC-VQA) that features high-quality human expert annotations across nine diverse languages. The benchmark aims to address the "visual-textual misalignment" problem in multilingual TEC-VQA, where translation-based protocols often overlook visual text in images. MTVQA consists of 8,794 images and 28,607 question-answer pairs, covering a wide range of scenarios from documents and natural scenes. The authors evaluate various state-of-the-art Multimodal Large Language Models (MLLMs) on MTVQA and find significant room for improvement, highlighting the need for advancements in multilingual visual text comprehension. The paper also discusses the limitations of existing datasets and models, emphasizing the importance of comprehensive and diverse language coverage. The MTVQA dataset is released under the CC BY-NC 4.0 license, aiming to inspire further research in the TEC-VQA community.

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

28 May 2025 | Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, Can Huang