Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

29 Feb 2024 | Xin Li Yunfei Wu Xinghua Jiang Zhihao Guo Mingming Gong Haoyu Cao Yinsong Liu Deqiang Jiang Xing Sun
This paper addresses the challenge of fine-grained feature collapse in Large Visual-Language Models (LVLMs) for visual document understanding (VDU). The authors propose a novel contrastive learning framework called Document Object COnttractive learning (DoCo), which aims to enhance the visual representation of LVLMs in text-rich scenarios. DoCo leverages an auxiliary multimodal encoder to extract fine-grained features of document objects and aligns them with the visual features generated by the vision encoder of LVLMs. This approach helps the vision encoder to acquire more effective visual cues, thereby improving the comprehension of text-rich documents. The proposed method is demonstrated to be a plug-and-play pre-training technique that can be applied to various LVLMs without increasing computational complexity during inference. Extensive experiments on multiple VDU benchmarks show that LVLMs equipped with DoCo achieve superior performance and reduce the gap between VDU and generic vision-language tasks. The paper also includes a detailed methodology, experimental results, and ablation studies to support the effectiveness of DoCo.This paper addresses the challenge of fine-grained feature collapse in Large Visual-Language Models (LVLMs) for visual document understanding (VDU). The authors propose a novel contrastive learning framework called Document Object COnttractive learning (DoCo), which aims to enhance the visual representation of LVLMs in text-rich scenarios. DoCo leverages an auxiliary multimodal encoder to extract fine-grained features of document objects and aligns them with the visual features generated by the vision encoder of LVLMs. This approach helps the vision encoder to acquire more effective visual cues, thereby improving the comprehension of text-rich documents. The proposed method is demonstrated to be a plug-and-play pre-training technique that can be applied to various LVLMs without increasing computational complexity during inference. Extensive experiments on multiple VDU benchmarks show that LVLMs equipped with DoCo achieve superior performance and reduce the gap between VDU and generic vision-language tasks. The paper also includes a detailed methodology, experimental results, and ablation studies to support the effectiveness of DoCo.
Reach us at info@study.space
Understanding Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models