TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

30 Apr 2024 | Yoonsik Kim, Moonbin Yim, Ka Yeon Song
TableVQA-Bench is a new benchmark for visual question answering on tables, constructed from existing table question-answering (QA) and table structure recognition (TSR) datasets. It includes 1,500 QA pairs, with images generated either through a stylesheet or a proposed table rendering system. QA pairs are generated using a large language model (LLM) with text-formatted tables as input. The benchmark includes four domains: VWTQ, VWTQ-Syn, VTabFact, and FinTabNetQA. The benchmark was evaluated using various multi-modal large language models (MLLMs), with GPT-4V achieving the highest accuracy. The study also found that the number of vision queries significantly affects performance, and that text-formatted tables generally outperform vision-formatted ones. The benchmark also includes a detailed analysis of the performance of different models on various table formats and their impact on accuracy. The results show that text-based models outperform vision-based models in terms of accuracy, and that the performance of MLLMs is significantly affected by the aspect ratio of the input image. The benchmark is available at https://github.com/naver-ai/tablevqabench.TableVQA-Bench is a new benchmark for visual question answering on tables, constructed from existing table question-answering (QA) and table structure recognition (TSR) datasets. It includes 1,500 QA pairs, with images generated either through a stylesheet or a proposed table rendering system. QA pairs are generated using a large language model (LLM) with text-formatted tables as input. The benchmark includes four domains: VWTQ, VWTQ-Syn, VTabFact, and FinTabNetQA. The benchmark was evaluated using various multi-modal large language models (MLLMs), with GPT-4V achieving the highest accuracy. The study also found that the number of vision queries significantly affects performance, and that text-formatted tables generally outperform vision-formatted ones. The benchmark also includes a detailed analysis of the performance of different models on various table formats and their impact on accuracy. The results show that text-based models outperform vision-based models in terms of accuracy, and that the performance of MLLMs is significantly affected by the aspect ratio of the input image. The benchmark is available at https://github.com/naver-ai/tablevqabench.
Reach us at info@study.space