[slides and audio] LayoutLLM%3A Large Language Model Instruction Tuning for Visually Rich Document Understanding

This paper introduces LayoutLLM, a novel approach for visually rich document understanding (VrDU) that combines large language models (LLMs) with VrDU models. VrDU tasks, such as document image classification and information extraction, require understanding both textual and visual information in documents. Traditional methods often require fine-tuning for each task and dataset, leading to high training and operational costs. LayoutLLM addresses these limitations by integrating VrDU models and LLMs, leveraging the strengths of both. The proposed model uses a pre-trained VrDU encoder (LayoutLMv3) to process document images and a pre-trained LLM (Llama) as a decoder to interpret task instructions. This single model can perform multiple VrDU tasks by fine-tuning with multimodal instruction datasets. Experiments on various benchmarks, including document image classification, information extraction, and visual question answering, demonstrate that LayoutLLM outperforms baseline models and improves performance on NLP tasks. The method's effectiveness is attributed to its ability to handle multiple tasks simultaneously and leverage the superior language understanding capabilities of LLMs.This paper introduces LayoutLLM, a novel approach for visually rich document understanding (VrDU) that combines large language models (LLMs) with VrDU models. VrDU tasks, such as document image classification and information extraction, require understanding both textual and visual information in documents. Traditional methods often require fine-tuning for each task and dataset, leading to high training and operational costs. LayoutLLM addresses these limitations by integrating VrDU models and LLMs, leveraging the strengths of both. The proposed model uses a pre-trained VrDU encoder (LayoutLMv3) to process document images and a pre-trained LLM (Llama) as a decoder to interpret task instructions. This single model can perform multiple VrDU tasks by fine-tuning with multimodal instruction datasets. Experiments on various benchmarks, including document image classification, information extraction, and visual question answering, demonstrate that LayoutLLM outperforms baseline models and improves performance on NLP tasks. The method's effectiveness is attributed to its ability to handle multiple tasks simultaneously and leverage the superior language understanding capabilities of LLMs.

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

21 Mar 2024 | Masato Fujitake