12 Jun 2024 | Mingyu Zheng, Xinwei Feng, Qingyi Si, Qiaqiao She, Zheng Lin, Wenbin Jiang, Weiping Wang
This paper introduces a new problem in multimodal table understanding, where a model must generate correct responses to various table-related requests based on a given table image. To facilitate model training and evaluation, the authors construct a large-scale dataset named MMTab, which covers a wide range of table images, instructions, and tasks. Based on this dataset, they develop Table-LLaVA, a generalist tabular multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks under both held-in and held-out settings. The code and data are available at https://github.com/SpursGoZmy/Table-LLaVA.
The paper discusses the challenges of table understanding, particularly the difficulty of accessing high-quality textual table representations in real-world scenarios, and the accessibility of table images. It highlights the importance of directly understanding tables using intuitive visual information. The authors propose the multimodal table understanding problem, where the model is required to generate correct responses to different table-related requests in an end-to-end fashion based on the table image. Despite the fact that recent multimodal large language models (MLLMs) have demonstrated excellent capabilities in many multimodal tasks, they cannot be directly extended to the proposed task. The authors construct MMTab, the first open-source large-scale dataset for multimodal table understanding, based on 14 publicly available table datasets of 8 domains. They carefully design scripts to convert original textual tables into table images highlighting a broad coverage of table structures and styles, and transform all task-specific samples into multimodal instruction-tuning samples with a unified format of <table image, input request, output response>. The resulting dataset contains 150K table recognition samples on 97K table images for pre-training, 232K samples of 14 table-based tasks on 82K table images for instruction tuning, and 49K test samples on 23K table images composing 17 held-in and 7 held-out benchmarks.
Based on the curated dataset, the authors develop a versatile tabular MLLM named Table-LLaVA with an enhanced two-stage training paradigm. In the first stage, they pre-train LLaVA-1.5 with an extra table recognition task on the MMTab-pre, which requires the model to generate textual sequences given table images. In the second stage, they continue to instruction-tune the model with diverse table-based downstream tasks on the MMTab-instruct. The experimental results show that Table-LLaVA beats strong MLLM baselines on 17 held-in and 6 held-out benchmarks, and is even competitive with the powerful GPT-4V on 14 benchmarks with a subset of test samples. The authors also explore the mutual influence between model's capacity for tabular tasks and non-tabular tasks. They conclude that their contributions include the first systematicThis paper introduces a new problem in multimodal table understanding, where a model must generate correct responses to various table-related requests based on a given table image. To facilitate model training and evaluation, the authors construct a large-scale dataset named MMTab, which covers a wide range of table images, instructions, and tasks. Based on this dataset, they develop Table-LLaVA, a generalist tabular multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks under both held-in and held-out settings. The code and data are available at https://github.com/SpursGoZmy/Table-LLaVA.
The paper discusses the challenges of table understanding, particularly the difficulty of accessing high-quality textual table representations in real-world scenarios, and the accessibility of table images. It highlights the importance of directly understanding tables using intuitive visual information. The authors propose the multimodal table understanding problem, where the model is required to generate correct responses to different table-related requests in an end-to-end fashion based on the table image. Despite the fact that recent multimodal large language models (MLLMs) have demonstrated excellent capabilities in many multimodal tasks, they cannot be directly extended to the proposed task. The authors construct MMTab, the first open-source large-scale dataset for multimodal table understanding, based on 14 publicly available table datasets of 8 domains. They carefully design scripts to convert original textual tables into table images highlighting a broad coverage of table structures and styles, and transform all task-specific samples into multimodal instruction-tuning samples with a unified format of <table image, input request, output response>. The resulting dataset contains 150K table recognition samples on 97K table images for pre-training, 232K samples of 14 table-based tasks on 82K table images for instruction tuning, and 49K test samples on 23K table images composing 17 held-in and 7 held-out benchmarks.
Based on the curated dataset, the authors develop a versatile tabular MLLM named Table-LLaVA with an enhanced two-stage training paradigm. In the first stage, they pre-train LLaVA-1.5 with an extra table recognition task on the MMTab-pre, which requires the model to generate textual sequences given table images. In the second stage, they continue to instruction-tune the model with diverse table-based downstream tasks on the MMTab-instruct. The experimental results show that Table-LLaVA beats strong MLLM baselines on 17 held-in and 6 held-out benchmarks, and is even competitive with the powerful GPT-4V on 14 benchmarks with a subset of test samples. The authors also explore the mutual influence between model's capacity for tabular tasks and non-tabular tasks. They conclude that their contributions include the first systematic