InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

24 Jan 2024 | Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki
InstructDoc is a large-scale dataset for zero-shot generalization of visual document understanding (VDU) with instructions. It includes 30 publicly available VDU datasets, covering 12 diverse tasks and open document types. The dataset provides diverse instructions in a unified format, annotated by experts, to enable models to understand document layout, visual representations, and object relationships. The paper introduces InstructDr, a new instruction-based document reading and understanding model that bridges document images, image encoders, and large language models (LLMs) through a trainable module called Document-former. InstructDr converts documents into useful features for LLMs and achieves the highest zero-shot performance among existing multimodal LLMs and outperforms ChatGPT on various VDU datasets with instructions. The paper also discusses the challenges of visual instruction tuning, where previous datasets have focused on understanding visual objects in scene images, and existing models struggle with tasks requiring visual document understanding. InstructDoc addresses these challenges by providing a diverse set of VDU tasks and open document types, enabling models to learn rich representations of document structures through instructions. The InstructDoc dataset includes a wide range of VDU tasks, such as key information extraction, single-page and multi-page QA with discrete and visual reasoning, document natural language inference, dialogue, captioning, classification, document layout analysis, and image-text matching. The dataset is split into held-in and held-out datasets for evaluation, with held-out datasets carefully selected to avoid data contamination. The paper compares InstructDoc with other VDU instruction tuning datasets, highlighting its unique properties, including coverage of open document types, a wider range of tasks, and more extensive instruction sets. The model InstructDr is evaluated on various VDU tasks and outperforms existing models, demonstrating its effectiveness in zero-shot learning scenarios. The paper also discusses the role of instructions in improving model performance, the robustness of the model to diverse instructions, and the impact of task clusters on model performance. The model is further evaluated on task-specific fine-tuning and shows superior performance on various VDU tasks. The paper concludes that InstructDoc provides a valuable resource for developing general-purpose document AI systems and that InstructDr is an effective model for instruction-based document understanding.InstructDoc is a large-scale dataset for zero-shot generalization of visual document understanding (VDU) with instructions. It includes 30 publicly available VDU datasets, covering 12 diverse tasks and open document types. The dataset provides diverse instructions in a unified format, annotated by experts, to enable models to understand document layout, visual representations, and object relationships. The paper introduces InstructDr, a new instruction-based document reading and understanding model that bridges document images, image encoders, and large language models (LLMs) through a trainable module called Document-former. InstructDr converts documents into useful features for LLMs and achieves the highest zero-shot performance among existing multimodal LLMs and outperforms ChatGPT on various VDU datasets with instructions. The paper also discusses the challenges of visual instruction tuning, where previous datasets have focused on understanding visual objects in scene images, and existing models struggle with tasks requiring visual document understanding. InstructDoc addresses these challenges by providing a diverse set of VDU tasks and open document types, enabling models to learn rich representations of document structures through instructions. The InstructDoc dataset includes a wide range of VDU tasks, such as key information extraction, single-page and multi-page QA with discrete and visual reasoning, document natural language inference, dialogue, captioning, classification, document layout analysis, and image-text matching. The dataset is split into held-in and held-out datasets for evaluation, with held-out datasets carefully selected to avoid data contamination. The paper compares InstructDoc with other VDU instruction tuning datasets, highlighting its unique properties, including coverage of open document types, a wider range of tasks, and more extensive instruction sets. The model InstructDr is evaluated on various VDU tasks and outperforms existing models, demonstrating its effectiveness in zero-shot learning scenarios. The paper also discusses the role of instructions in improving model performance, the robustness of the model to diverse instructions, and the impact of task clusters on model performance. The model is further evaluated on task-specific fine-tuning and shows superior performance on various VDU tasks. The paper concludes that InstructDoc provides a valuable resource for developing general-purpose document AI systems and that InstructDr is an effective model for instruction-based document understanding.
Reach us at info@study.space