12 Jun 2024 | Mingyu Zheng, Xinwei Feng, Qingyi Si, Qiaoqiao She, Zheng Lin, Wenbin Jiang, Weiping Wang
This paper introduces the concept of multimodal table understanding, a novel problem where models must generate responses to various table-related requests based on table images. The authors address the challenge of accessing high-quality textual table representations in real-world scenarios and propose a large-scale dataset named MMTab, which covers a wide range of table images, instructions, and tasks. They develop Table-LLaVA, a generalist tabular multimodal large language model (MLLM), which significantly outperforms existing MLLM baselines on multiple benchmarks. The dataset and model are designed to facilitate the advancement of table understanding and its practical applications. The paper also includes a comprehensive evaluation of Table-LLaVA's performance and discusses its limitations and ethical considerations.This paper introduces the concept of multimodal table understanding, a novel problem where models must generate responses to various table-related requests based on table images. The authors address the challenge of accessing high-quality textual table representations in real-world scenarios and propose a large-scale dataset named MMTab, which covers a wide range of table images, instructions, and tasks. They develop Table-LLaVA, a generalist tabular multimodal large language model (MLLM), which significantly outperforms existing MLLM baselines on multiple benchmarks. The dataset and model are designed to facilitate the advancement of table understanding and its practical applications. The paper also includes a comprehensive evaluation of Table-LLaVA's performance and discusses its limitations and ethical considerations.