17 Jun 2024 | Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhuchen, Jimmy Lin, David R. Cheriton
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that treats document screenshots as a unified input format, eliminating the need for content extraction and preserving all document information, including text, images, and layout. DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. The method is evaluated on the Wiki-SS dataset, consisting of 1.3 million Wikipedia web page screenshots, and compared with traditional text retrieval methods like BM25. DSE outperforms BM25 by 17 points in top-1 retrieval accuracy and significantly outperforms OCR-based methods in slide retrieval tasks. The method is also tested on a mixed-modality task, where it outperforms text-based retrieval methods by over 15 points in nDCG@10. DSE is effective for diverse document types and has the potential to enhance document retrieval in real-world applications. The model checkpoints, code, and Wiki-SS collection are available at http://tevatron.ai. The method uses a bi-encoder architecture, where a document screenshot and user text query are encoded into dense vectors using a vision and text encoder. The vision encoder processes the screenshot into latent representations, while the vision language model captures more fine-grained information. The similarity between the query and the document is computed using cosine similarity between their embeddings. DSE is trained using the InfoNCE loss and has shown superior performance in both text-intensive and mixed-modality retrieval tasks. The method also demonstrates strong zero-shot effectiveness across different query distributions and tasks. The trade-off between effectiveness and efficiency is studied, showing that increasing the number of crops for input images improves retrieval effectiveness but reduces encoding speed. The method is effective in capturing information from various modalities within the screenshots, enhancing its retrieval capabilities. The results indicate that DSE can effectively encode document information beyond text, making it suitable for diverse document types and tasks. The method has limitations, including its reliance on visual data and the need for further exploration in multi-task training and contrastive pretraining. The work contributes to the field of multi-modal information retrieval by introducing a new approach that simplifies the document retrieval process and improves retrieval effectiveness.Document Screenshot Embedding (DSE) is a novel retrieval paradigm that treats document screenshots as a unified input format, eliminating the need for content extraction and preserving all document information, including text, images, and layout. DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. The method is evaluated on the Wiki-SS dataset, consisting of 1.3 million Wikipedia web page screenshots, and compared with traditional text retrieval methods like BM25. DSE outperforms BM25 by 17 points in top-1 retrieval accuracy and significantly outperforms OCR-based methods in slide retrieval tasks. The method is also tested on a mixed-modality task, where it outperforms text-based retrieval methods by over 15 points in nDCG@10. DSE is effective for diverse document types and has the potential to enhance document retrieval in real-world applications. The model checkpoints, code, and Wiki-SS collection are available at http://tevatron.ai. The method uses a bi-encoder architecture, where a document screenshot and user text query are encoded into dense vectors using a vision and text encoder. The vision encoder processes the screenshot into latent representations, while the vision language model captures more fine-grained information. The similarity between the query and the document is computed using cosine similarity between their embeddings. DSE is trained using the InfoNCE loss and has shown superior performance in both text-intensive and mixed-modality retrieval tasks. The method also demonstrates strong zero-shot effectiveness across different query distributions and tasks. The trade-off between effectiveness and efficiency is studied, showing that increasing the number of crops for input images improves retrieval effectiveness but reduces encoding speed. The method is effective in capturing information from various modalities within the screenshots, enhancing its retrieval capabilities. The results indicate that DSE can effectively encode document information beyond text, making it suitable for diverse document types and tasks. The method has limitations, including its reliance on visual data and the need for further exploration in multi-task training and contrastive pretraining. The work contributes to the field of multi-modal information retrieval by introducing a new approach that simplifies the document retrieval process and improves retrieval effectiveness.