Towards VQA Models That Can Read

Towards VQA Models That Can Read

13 May 2019 | Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Marcus Rohrbach
This paper introduces a new dataset called TextVQA and a novel model architecture called LoRRA to address the challenge of Visual Question Answering (VQA) models that can read text in images. Existing VQA models struggle with questions that require reading and reasoning about text, as these tasks are not well-supported by current deep learning approaches. TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. The dataset is designed to better reflect the needs of visually impaired users who frequently ask questions about text in images. LoRRA is a new model architecture that explicitly reasons over the outputs from an OCR system when answering questions. It includes components for reading text in the image, reasoning about the text and visual content, and predicting an answer that may be a deduction based on the text or a direct copy from the text. LoRRA outperforms existing state-of-the-art VQA models on both the TextVQA and VQA 2.0 datasets. The gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0. The paper also discusses related work in VQA and text-based VQA, and presents an overview of the LoRRA model architecture. The model uses an OCR module to read text in images, and includes a copy mechanism that allows the model to generate answers by copying OCR tokens or deducing answers from the text. The model is trained on the TextVQA dataset and achieves state-of-the-art results on both the TextVQA and VQA 2.0 datasets. The paper concludes that LoRRA significantly outperforms existing VQA models on TextVQA and highlights the importance of incorporating OCR and reasoning capabilities into VQA models to better handle text-based questions.This paper introduces a new dataset called TextVQA and a novel model architecture called LoRRA to address the challenge of Visual Question Answering (VQA) models that can read text in images. Existing VQA models struggle with questions that require reading and reasoning about text, as these tasks are not well-supported by current deep learning approaches. TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. The dataset is designed to better reflect the needs of visually impaired users who frequently ask questions about text in images. LoRRA is a new model architecture that explicitly reasons over the outputs from an OCR system when answering questions. It includes components for reading text in the image, reasoning about the text and visual content, and predicting an answer that may be a deduction based on the text or a direct copy from the text. LoRRA outperforms existing state-of-the-art VQA models on both the TextVQA and VQA 2.0 datasets. The gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0. The paper also discusses related work in VQA and text-based VQA, and presents an overview of the LoRRA model architecture. The model uses an OCR module to read text in images, and includes a copy mechanism that allows the model to generate answers by copying OCR tokens or deducing answers from the text. The model is trained on the TextVQA dataset and achieves state-of-the-art results on both the TextVQA and VQA 2.0 datasets. The paper concludes that LoRRA significantly outperforms existing VQA models on TextVQA and highlights the importance of incorporating OCR and reasoning capabilities into VQA models to better handle text-based questions.
Reach us at info@study.space
[slides] Towards VQA Models That Can Read | StudySpace