Understanding Unifying Multimodal Retrieval via Document Screenshot Embedding

The paper introduces *Document Screenshot Embedding* (DSE), a novel information retrieval paradigm that simplifies the document retrieval process by directly encoding document screenshots into dense representations using a large vision-language model. This approach bypasses the need for traditional content extraction and parsing, preserving all information in the document, including text, images, and layout. DSE is evaluated on two datasets: Wiki-SS, a 1.3M Wikipedia web page screenshots corpus, and SlideVQA, a visual QA dataset converted into an open-domain slide retrieval task. Results show that DSE outperforms traditional text-based retrieval methods (e.g., BM25) by 17 points in top-1 retrieval accuracy and significantly outperforms OCR-based methods in nDCG@10 for slide retrieval tasks. The paper also discusses the trade-offs between the efficiency and quality of document encoding, with finer-grained patch encoding improving retrieval effectiveness but reducing computational efficiency. Overall, DSE demonstrates the potential of leveraging screenshots to enhance document retrieval in diverse real-world applications.The paper introduces *Document Screenshot Embedding* (DSE), a novel information retrieval paradigm that simplifies the document retrieval process by directly encoding document screenshots into dense representations using a large vision-language model. This approach bypasses the need for traditional content extraction and parsing, preserving all information in the document, including text, images, and layout. DSE is evaluated on two datasets: Wiki-SS, a 1.3M Wikipedia web page screenshots corpus, and SlideVQA, a visual QA dataset converted into an open-domain slide retrieval task. Results show that DSE outperforms traditional text-based retrieval methods (e.g., BM25) by 17 points in top-1 retrieval accuracy and significantly outperforms OCR-based methods in nDCG@10 for slide retrieval tasks. The paper also discusses the trade-offs between the efficiency and quality of document encoding, with finer-grained patch encoding improving retrieval effectiveness but reducing computational efficiency. Overall, DSE demonstrates the potential of leveraging screenshots to enhance document retrieval in diverse real-world applications.

Unifying Multimodal Retrieval via Document Screenshot Embedding

17 Jun 2024 | Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhui Chen, Jimmy Lin