Understanding LayoutLLM%3A Layout Instruction Tuning with Large Language Models for Document Understanding

**LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding** **Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, Cong Yao** **Alibaba Group, Zhejiang University** **{luochuwei, zzhaoqing.z, zhengqisjtu, yaocong2010}@gmail.com** **{syficy, yuzhihrenzhe}@zju.edu.cn** **Abstract** Recently, leveraging large language models (LLMs) or multimodal large language models (MLLMs) for document understanding has shown promising results. However, previous works have not fully explored and utilized document layout information, which is crucial for precise document understanding. This paper proposes LayoutLLM, an LLM/MLLM-based method for document understanding. The core of LayoutLLM is a layout instruction tuning strategy, designed to enhance the comprehension and utilization of document layouts. The strategy consists of two components: Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture document layout characteristics during pre-training, three groups of tasks—document-level, regional-level, and segment-level—are introduced. Additionally, a novel module called LayoutCoT is introduced to enable LayoutLLM to focus on relevant regions and generate accurate answers. LayoutCoT enhances performance and provides interpretability, facilitating manual inspection and correction. Experiments on standard benchmarks demonstrate that LayoutLLM significantly outperforms existing methods using open-source 7B LLMs/MLLMs for document understanding. **Introduction** Document AI, including tasks like document VQA and visual information extraction, is a hot topic in academia and industry. While pre-trained models have achieved excellent performance, adapting them for zero-shot document understanding remains challenging due to the need for fine-tuning on downstream task data. LLMs and MLLMs have shown remarkable zero-shot capabilities across various applications, but their use in document understanding has been limited by the lack of effective layout information representation. **LayoutLLM** LayoutLLM integrates a document pre-trained model as the encoder and employs a layout instruction tuning strategy, consisting of layout-aware pre-training and layout-aware supervised fine-tuning. The model architecture includes a document pre-trained model encoder, multimodal projectors, and a large language model. The layout instruction tuning strategy enhances the model's ability to understand and utilize document layouts, improving zero-shot document understanding performance. **Experiments** Extensive experiments on document understanding benchmarks show that LayoutLLM significantly outperforms existing methods using open-source LLMs/MLLMs. Ablation studies and qualitative results further validate the effectiveness of the proposed layout-aware pre-training and fine-tuning strategies. Interactive correction capabilities of LayoutCoT are also demonstrated, highlighting its potential in high-stake scenarios. **Conclusion** LayoutLLM effectively leverages layout information for document understanding, significantly improving zero-shot performance. Future work will**LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding** **Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, Cong Yao** **Alibaba Group, Zhejiang University** **{luochuwei, zzhaoqing.z, zhengqisjtu, yaocong2010}@gmail.com** **{syficy, yuzhihrenzhe}@zju.edu.cn** **Abstract** Recently, leveraging large language models (LLMs) or multimodal large language models (MLLMs) for document understanding has shown promising results. However, previous works have not fully explored and utilized document layout information, which is crucial for precise document understanding. This paper proposes LayoutLLM, an LLM/MLLM-based method for document understanding. The core of LayoutLLM is a layout instruction tuning strategy, designed to enhance the comprehension and utilization of document layouts. The strategy consists of two components: Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture document layout characteristics during pre-training, three groups of tasks—document-level, regional-level, and segment-level—are introduced. Additionally, a novel module called LayoutCoT is introduced to enable LayoutLLM to focus on relevant regions and generate accurate answers. LayoutCoT enhances performance and provides interpretability, facilitating manual inspection and correction. Experiments on standard benchmarks demonstrate that LayoutLLM significantly outperforms existing methods using open-source 7B LLMs/MLLMs for document understanding. **Introduction** Document AI, including tasks like document VQA and visual information extraction, is a hot topic in academia and industry. While pre-trained models have achieved excellent performance, adapting them for zero-shot document understanding remains challenging due to the need for fine-tuning on downstream task data. LLMs and MLLMs have shown remarkable zero-shot capabilities across various applications, but their use in document understanding has been limited by the lack of effective layout information representation. **LayoutLLM** LayoutLLM integrates a document pre-trained model as the encoder and employs a layout instruction tuning strategy, consisting of layout-aware pre-training and layout-aware supervised fine-tuning. The model architecture includes a document pre-trained model encoder, multimodal projectors, and a large language model. The layout instruction tuning strategy enhances the model's ability to understand and utilize document layouts, improving zero-shot document understanding performance. **Experiments** Extensive experiments on document understanding benchmarks show that LayoutLLM significantly outperforms existing methods using open-source LLMs/MLLMs. Ablation studies and qualitative results further validate the effectiveness of the proposed layout-aware pre-training and fine-tuning strategies. Interactive correction capabilities of LayoutCoT are also demonstrated, highlighting its potential in high-stake scenarios. **Conclusion** LayoutLLM effectively leverages layout information for document understanding, significantly improving zero-shot performance. Future work will

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

8 Apr 2024 | Chuwei Luo1*, Yufan Shen12*, Zhaoqing Zhu1*, Qi Zheng1, Zhi Yu2, Cong Yao1

8 Apr 2024 | Chuwei Luo1, Yufan Shen12, Zhaoqing Zhu1*, Qi Zheng1, Zhi Yu2, Cong Yao1