This paper introduces a novel contrastive learning framework called Document Object COntrastive learning (DoCo) to enhance visual document understanding (VDU) in large visual-language models (LVLMs). The main challenge addressed is the "fine-grained feature collapse" issue, where LVLMs struggle to extract detailed visual features from text-rich documents. DoCo is designed to align document object features with visual features generated by LVLMs, enabling more effective visual representations for text-rich scenarios.
DoCo employs a contrastive learning approach that leverages a multimodal encoder to extract visual, layout, and textual features of document objects. These features are then aligned with the visual features produced by the image encoder of LVLMs. This alignment enhances the vision encoder's ability to capture fine-grained visual cues, improving the model's comprehension of text-rich documents.
The proposed DoCo is a plug-and-play pre-training method that can be applied to various LVLMs without increasing computational complexity during inference. Experimental results on multiple benchmarks show that LVLMs equipped with DoCo achieve superior performance in VDU tasks, narrowing the gap between VDU and generic vision-language tasks.
The methodology involves two main components: Intra-DoCo and Inter-DoCo. Intra-DoCo focuses on learning document object representations within a single image, while Inter-DoCo aims to learn representations across different images. Both components contribute to enhancing the visual representations of LVLMs in text-rich scenarios.
The framework also includes a ROI Aggregation module to extract relevant features from document objects, improving the accuracy of visual feature extraction. The results of ablation studies and qualitative analyses demonstrate that DoCo significantly improves the performance of LVLMs in text-rich scenarios by capturing fine-grained visual cues and aligning multimodal features with visual representations.
Overall, DoCo addresses the fine-grained feature collapse issue by enhancing the visual representations of LVLMs, leading to improved performance in visual document understanding tasks. The proposed method is effective in capturing detailed visual information from text-rich documents, making it a valuable addition to the field of visual document understanding.This paper introduces a novel contrastive learning framework called Document Object COntrastive learning (DoCo) to enhance visual document understanding (VDU) in large visual-language models (LVLMs). The main challenge addressed is the "fine-grained feature collapse" issue, where LVLMs struggle to extract detailed visual features from text-rich documents. DoCo is designed to align document object features with visual features generated by LVLMs, enabling more effective visual representations for text-rich scenarios.
DoCo employs a contrastive learning approach that leverages a multimodal encoder to extract visual, layout, and textual features of document objects. These features are then aligned with the visual features produced by the image encoder of LVLMs. This alignment enhances the vision encoder's ability to capture fine-grained visual cues, improving the model's comprehension of text-rich documents.
The proposed DoCo is a plug-and-play pre-training method that can be applied to various LVLMs without increasing computational complexity during inference. Experimental results on multiple benchmarks show that LVLMs equipped with DoCo achieve superior performance in VDU tasks, narrowing the gap between VDU and generic vision-language tasks.
The methodology involves two main components: Intra-DoCo and Inter-DoCo. Intra-DoCo focuses on learning document object representations within a single image, while Inter-DoCo aims to learn representations across different images. Both components contribute to enhancing the visual representations of LVLMs in text-rich scenarios.
The framework also includes a ROI Aggregation module to extract relevant features from document objects, improving the accuracy of visual feature extraction. The results of ablation studies and qualitative analyses demonstrate that DoCo significantly improves the performance of LVLMs in text-rich scenarios by capturing fine-grained visual cues and aligning multimodal features with visual representations.
Overall, DoCo addresses the fine-grained feature collapse issue by enhancing the visual representations of LVLMs, leading to improved performance in visual document understanding tasks. The proposed method is effective in capturing detailed visual information from text-rich documents, making it a valuable addition to the field of visual document understanding.