HRVDA: High-Resolution Visual Document Assistant

HRVDA: High-Resolution Visual Document Assistant

10 Apr 2024 | Chaohu Liu, Kun Yin, Haoyu Cao, Xinghua Jiang, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Linli Xu
HRVDA is a high-resolution visual document assistant designed to address the challenges of visual document understanding. It leverages multimodal large language models (MLLMs) and introduces content filtering and instruction filtering mechanisms to efficiently process high-resolution images. The content filtering mechanism identifies and removes content-agnostic visual tokens, while the instruction filtering module eliminates instruction-agnostic tokens, reducing computational load and improving model efficiency. Additionally, a document-oriented visual instruction tuning dataset is constructed to enhance the model's ability to understand document-specific instructions. The model achieves state-of-the-art performance on multiple document understanding datasets while maintaining training and inference efficiency comparable to low-resolution models. HRVDA is the first MLLM to directly accept high-resolution image inputs and employs a Swin Transformer as its encoder. Experimental results demonstrate that HRVDA outperforms existing OCR-free models in various document understanding tasks, including information extraction and visual question answering. The model's efficiency is further validated by its faster inference speed compared to other models, even when processing high-resolution images. HRVDA's approach significantly improves the accuracy and efficiency of visual document understanding tasks by effectively pruning redundant visual tokens and enhancing the model's ability to process document-specific instructions.HRVDA is a high-resolution visual document assistant designed to address the challenges of visual document understanding. It leverages multimodal large language models (MLLMs) and introduces content filtering and instruction filtering mechanisms to efficiently process high-resolution images. The content filtering mechanism identifies and removes content-agnostic visual tokens, while the instruction filtering module eliminates instruction-agnostic tokens, reducing computational load and improving model efficiency. Additionally, a document-oriented visual instruction tuning dataset is constructed to enhance the model's ability to understand document-specific instructions. The model achieves state-of-the-art performance on multiple document understanding datasets while maintaining training and inference efficiency comparable to low-resolution models. HRVDA is the first MLLM to directly accept high-resolution image inputs and employs a Swin Transformer as its encoder. Experimental results demonstrate that HRVDA outperforms existing OCR-free models in various document understanding tasks, including information extraction and visual question answering. The model's efficiency is further validated by its faster inference speed compared to other models, even when processing high-resolution images. HRVDA's approach significantly improves the accuracy and efficiency of visual document understanding tasks by effectively pruning redundant visual tokens and enhancing the model's ability to process document-specific instructions.
Reach us at info@study.space