HRVDA: High-Resolution Visual Document Assistant

HRVDA: High-Resolution Visual Document Assistant

10 Apr 2024 | Chaohu Liu, Kun Yin, Haoyu Cao, Xinghua Jiang, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Linli Xu
**Abstract:** This paper introduces HRVDA (High-Resolution Visual Document Assistant), a novel multimodal large language model designed to enhance visual document understanding capabilities. The model addresses the limitations of current MLLMs in handling high-resolution images and document-oriented instructions. HRVDA employs a content filtering mechanism and an instruction filtering module to efficiently process high-resolution images and filter out content-agnostic and instruction-agnostic visual tokens. Additionally, a document-oriented visual instruction tuning dataset is constructed, and a multi-stage training strategy is applied to improve the model's document modeling capabilities. Extensive experiments demonstrate that HRVDA achieves state-of-the-art performance on multiple document understanding datasets while maintaining efficient training and inference speeds. **Introduction:** Large Language Models (LLMs) have shown significant progress in general artificial intelligence, but their performance in visual document understanding remains limited due to the challenges posed by low-resolution image inputs and the lack of document-oriented visual instruction tuning. HRVDA aims to bridge this gap by directly processing high-resolution images and enhancing document understanding capabilities. **Related Work:** The paper reviews existing methods in visual document understanding, including OCR-dependent and OCR-free approaches, and discusses multimodal large language models and token pruning techniques. **HRVDA:** - **Model Architecture:** HRVDA consists of a content detector, an image encoder, an instruction filtering module, and an LLM. - **Content Filtering:** A content detector identifies tokens containing significant content, and a content filtering mechanism prunes content-agnostic tokens. - **Instruction Filtering:** An instruction filtering module further filters out instruction-agnostic tokens. - **Visual Instruction Tuning:** A diverse dataset of document tasks is used for tuning, and ChatGPT generates diverse instruction templates. - **Training Strategy:** A multi-stage training approach is employed to fine-tune the model, focusing on content detection, image encoding, and instruction filtering. **Experiments:** - **Tasks and Datasets:** HRVDA is evaluated on various document-oriented datasets, including information extraction and visual question answering tasks. - **Implementation Details:** The model uses SwinL as the image encoder and LLaMA-2-7B as the LLM. - **Comparisons and Ablation Studies:** HRVDA outperforms existing models in multiple datasets and demonstrates improved efficiency in inference latency. - **Qualitative Analysis:** HRVDA shows strong performance in recognizing text and handling complex images, but faces challenges with highly dense text and extreme image proportions. **Conclusion:** HRVDA is a novel OCR-free multimodal large language model that effectively processes high-resolution images and enhances document understanding. It achieves state-of-the-art performance and significantly improves efficiency compared to previous models.**Abstract:** This paper introduces HRVDA (High-Resolution Visual Document Assistant), a novel multimodal large language model designed to enhance visual document understanding capabilities. The model addresses the limitations of current MLLMs in handling high-resolution images and document-oriented instructions. HRVDA employs a content filtering mechanism and an instruction filtering module to efficiently process high-resolution images and filter out content-agnostic and instruction-agnostic visual tokens. Additionally, a document-oriented visual instruction tuning dataset is constructed, and a multi-stage training strategy is applied to improve the model's document modeling capabilities. Extensive experiments demonstrate that HRVDA achieves state-of-the-art performance on multiple document understanding datasets while maintaining efficient training and inference speeds. **Introduction:** Large Language Models (LLMs) have shown significant progress in general artificial intelligence, but their performance in visual document understanding remains limited due to the challenges posed by low-resolution image inputs and the lack of document-oriented visual instruction tuning. HRVDA aims to bridge this gap by directly processing high-resolution images and enhancing document understanding capabilities. **Related Work:** The paper reviews existing methods in visual document understanding, including OCR-dependent and OCR-free approaches, and discusses multimodal large language models and token pruning techniques. **HRVDA:** - **Model Architecture:** HRVDA consists of a content detector, an image encoder, an instruction filtering module, and an LLM. - **Content Filtering:** A content detector identifies tokens containing significant content, and a content filtering mechanism prunes content-agnostic tokens. - **Instruction Filtering:** An instruction filtering module further filters out instruction-agnostic tokens. - **Visual Instruction Tuning:** A diverse dataset of document tasks is used for tuning, and ChatGPT generates diverse instruction templates. - **Training Strategy:** A multi-stage training approach is employed to fine-tune the model, focusing on content detection, image encoding, and instruction filtering. **Experiments:** - **Tasks and Datasets:** HRVDA is evaluated on various document-oriented datasets, including information extraction and visual question answering tasks. - **Implementation Details:** The model uses SwinL as the image encoder and LLaMA-2-7B as the LLM. - **Comparisons and Ablation Studies:** HRVDA outperforms existing models in multiple datasets and demonstrates improved efficiency in inference latency. - **Qualitative Analysis:** HRVDA shows strong performance in recognizing text and handling complex images, but faces challenges with highly dense text and extreme image proportions. **Conclusion:** HRVDA is a novel OCR-free multimodal large language model that effectively processes high-resolution images and enhances document understanding. It achieves state-of-the-art performance and significantly improves efficiency compared to previous models.
Reach us at info@study.space