**Abstract:**
This paper introduces HRVDA (High-Resolution Visual Document Assistant), a novel multimodal large language model designed to enhance visual document understanding capabilities. The model addresses the limitations of current MLLMs in handling high-resolution images and document-oriented instructions. HRVDA employs a content filtering mechanism and an instruction filtering module to efficiently process high-resolution images and filter out content-agnostic and instruction-agnostic visual tokens. Additionally, a document-oriented visual instruction tuning dataset is constructed, and a multi-stage training strategy is applied to improve the model's document modeling capabilities. Extensive experiments demonstrate that HRVDA achieves state-of-the-art performance on multiple document understanding datasets while maintaining efficient training and inference speeds.
**Introduction:**
Large Language Models (LLMs) have shown significant progress in general artificial intelligence, but their performance in visual document understanding remains limited due to the challenges posed by low-resolution image inputs and the lack of document-oriented visual instruction tuning. HRVDA aims to bridge this gap by directly processing high-resolution images and enhancing document understanding capabilities.
**Related Work:**
The paper reviews existing methods in visual document understanding, including OCR-dependent and OCR-free approaches, and discusses multimodal large language models and token pruning techniques.
**HRVDA:**
- **Model Architecture:** HRVDA consists of a content detector, an image encoder, an instruction filtering module, and an LLM.
- **Content Filtering:** A content detector identifies tokens containing significant content, and a content filtering mechanism prunes content-agnostic tokens.
- **Instruction Filtering:** An instruction filtering module further filters out instruction-agnostic tokens.
- **Visual Instruction Tuning:** A diverse dataset of document tasks is used for tuning, and ChatGPT generates diverse instruction templates.
- **Training Strategy:** A multi-stage training approach is employed to fine-tune the model, focusing on content detection, image encoding, and instruction filtering.
**Experiments:**
- **Tasks and Datasets:** HRVDA is evaluated on various document-oriented datasets, including information extraction and visual question answering tasks.
- **Implementation Details:** The model uses SwinL as the image encoder and LLaMA-2-7B as the LLM.
- **Comparisons and Ablation Studies:** HRVDA outperforms existing models in multiple datasets and demonstrates improved efficiency in inference latency.
- **Qualitative Analysis:** HRVDA shows strong performance in recognizing text and handling complex images, but faces challenges with highly dense text and extreme image proportions.
**Conclusion:**
HRVDA is a novel OCR-free multimodal large language model that effectively processes high-resolution images and enhances document understanding. It achieves state-of-the-art performance and significantly improves efficiency compared to previous models.**Abstract:**
This paper introduces HRVDA (High-Resolution Visual Document Assistant), a novel multimodal large language model designed to enhance visual document understanding capabilities. The model addresses the limitations of current MLLMs in handling high-resolution images and document-oriented instructions. HRVDA employs a content filtering mechanism and an instruction filtering module to efficiently process high-resolution images and filter out content-agnostic and instruction-agnostic visual tokens. Additionally, a document-oriented visual instruction tuning dataset is constructed, and a multi-stage training strategy is applied to improve the model's document modeling capabilities. Extensive experiments demonstrate that HRVDA achieves state-of-the-art performance on multiple document understanding datasets while maintaining efficient training and inference speeds.
**Introduction:**
Large Language Models (LLMs) have shown significant progress in general artificial intelligence, but their performance in visual document understanding remains limited due to the challenges posed by low-resolution image inputs and the lack of document-oriented visual instruction tuning. HRVDA aims to bridge this gap by directly processing high-resolution images and enhancing document understanding capabilities.
**Related Work:**
The paper reviews existing methods in visual document understanding, including OCR-dependent and OCR-free approaches, and discusses multimodal large language models and token pruning techniques.
**HRVDA:**
- **Model Architecture:** HRVDA consists of a content detector, an image encoder, an instruction filtering module, and an LLM.
- **Content Filtering:** A content detector identifies tokens containing significant content, and a content filtering mechanism prunes content-agnostic tokens.
- **Instruction Filtering:** An instruction filtering module further filters out instruction-agnostic tokens.
- **Visual Instruction Tuning:** A diverse dataset of document tasks is used for tuning, and ChatGPT generates diverse instruction templates.
- **Training Strategy:** A multi-stage training approach is employed to fine-tune the model, focusing on content detection, image encoding, and instruction filtering.
**Experiments:**
- **Tasks and Datasets:** HRVDA is evaluated on various document-oriented datasets, including information extraction and visual question answering tasks.
- **Implementation Details:** The model uses SwinL as the image encoder and LLaMA-2-7B as the LLM.
- **Comparisons and Ablation Studies:** HRVDA outperforms existing models in multiple datasets and demonstrates improved efficiency in inference latency.
- **Qualitative Analysis:** HRVDA shows strong performance in recognizing text and handling complex images, but faces challenges with highly dense text and extreme image proportions.
**Conclusion:**
HRVDA is a novel OCR-free multimodal large language model that effectively processes high-resolution images and enhances document understanding. It achieves state-of-the-art performance and significantly improves efficiency compared to previous models.