11 Oct 2024 | Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shu Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Wengang Zhou, Houqiang Li, Can Huang
TabPedia is a novel large vision-language model designed for comprehensive visual table understanding (VTU). It introduces a concept synergy mechanism that integrates diverse VTU tasks and multi-source visual embeddings into concepts, enabling seamless integration of tasks such as table detection, structure recognition, querying, and question answering. The model leverages large language models (LLMs) to understand and generate responses based on visual table information. A new benchmark, ComTQA, is introduced to evaluate VTU tasks, featuring approximately 9,000 QA pairs. Extensive experiments on various benchmarks validate TabPedia's effectiveness, demonstrating its ability to perform well in both perception and comprehension tasks. The model's concept synergy mechanism allows tasks to work in harmony by effectively leveraging clues from corresponding source embeddings. TabPedia's dual vision encoders (Swin-B and ViT-L) extract high- and low-resolution features, which are combined with instruction-derived tokens to generate responses. The model's performance is evaluated on tasks such as table detection, structure recognition, querying, and question answering, with results showing superior performance compared to existing methods. The model's ability to handle complex tasks and its effectiveness in real-world scenarios are highlighted. Limitations include the model's inability to accurately parse twisted or distorted tables and its reliance on table-dominated datasets for TQA tasks. Despite these limitations, TabPedia demonstrates strong capabilities in visual table understanding and is a significant advancement in the field.TabPedia is a novel large vision-language model designed for comprehensive visual table understanding (VTU). It introduces a concept synergy mechanism that integrates diverse VTU tasks and multi-source visual embeddings into concepts, enabling seamless integration of tasks such as table detection, structure recognition, querying, and question answering. The model leverages large language models (LLMs) to understand and generate responses based on visual table information. A new benchmark, ComTQA, is introduced to evaluate VTU tasks, featuring approximately 9,000 QA pairs. Extensive experiments on various benchmarks validate TabPedia's effectiveness, demonstrating its ability to perform well in both perception and comprehension tasks. The model's concept synergy mechanism allows tasks to work in harmony by effectively leveraging clues from corresponding source embeddings. TabPedia's dual vision encoders (Swin-B and ViT-L) extract high- and low-resolution features, which are combined with instruction-derived tokens to generate responses. The model's performance is evaluated on tasks such as table detection, structure recognition, querying, and question answering, with results showing superior performance compared to existing methods. The model's ability to handle complex tasks and its effectiveness in real-world scenarios are highlighted. Limitations include the model's inability to accurately parse twisted or distorted tables and its reliance on table-dominated datasets for TQA tasks. Despite these limitations, TabPedia demonstrates strong capabilities in visual table understanding and is a significant advancement in the field.