24 Jan 2024 | Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki
The paper introduces InstructDoc, a large-scale dataset for zero-shot generalization of visual document understanding (VDU) tasks using human-written instructions. InstructDoc comprises 30 publicly available VDU datasets, each with diverse instructions in a unified format, covering 12 tasks and various document types. To enhance generalization performance, the authors propose InstructDr, a model that integrates document images, image encoders, and large language models (LLMs) through a trainable bridging module called Document-former. Experiments demonstrate that InstructDr effectively adapts to new VDU datasets, tasks, and domains via given instructions, outperforming existing multimodal LLMs and ChatGPT without specific training. The paper also includes a detailed introduction to VDU tasks, related work, dataset collection, and experimental results, highlighting the effectiveness of InstructDoc and InstructDr in improving zero-shot learning performance.The paper introduces InstructDoc, a large-scale dataset for zero-shot generalization of visual document understanding (VDU) tasks using human-written instructions. InstructDoc comprises 30 publicly available VDU datasets, each with diverse instructions in a unified format, covering 12 tasks and various document types. To enhance generalization performance, the authors propose InstructDr, a model that integrates document images, image encoders, and large language models (LLMs) through a trainable bridging module called Document-former. Experiments demonstrate that InstructDr effectively adapts to new VDU datasets, tasks, and domains via given instructions, outperforming existing multimodal LLMs and ChatGPT without specific training. The paper also includes a detailed introduction to VDU tasks, related work, dataset collection, and experimental results, highlighting the effectiveness of InstructDoc and InstructDr in improving zero-shot learning performance.