Towards VQA Models That Can Read

Towards VQA Models That Can Read

13 May 2019 | Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Marcus Rohrbach
This paper addresses the challenge of enabling Visual Question Answering (VQA) models to read and reason about text in images, a capability that is crucial for assisting visually impaired users. The authors introduce a new dataset called TextVQA, which contains 45,336 questions on 28,408 images, focusing on scenarios where reading and reasoning about text are essential. They propose a novel model architecture called Look, Read, Reason & Answer (LoRRA), which integrates Optical Character Recognition (OCR) to read and process text in images. LoRRA can predict answers from a fixed vocabulary or directly from the detected text. The model outperforms existing state-of-the-art VQA models on the TextVQA dataset, highlighting the significant gap between human and machine performance in this domain. The paper also discusses the limitations of current VQA models and suggests future directions for improving text detection and reasoning in unconstrained environments.This paper addresses the challenge of enabling Visual Question Answering (VQA) models to read and reason about text in images, a capability that is crucial for assisting visually impaired users. The authors introduce a new dataset called TextVQA, which contains 45,336 questions on 28,408 images, focusing on scenarios where reading and reasoning about text are essential. They propose a novel model architecture called Look, Read, Reason & Answer (LoRRA), which integrates Optical Character Recognition (OCR) to read and process text in images. LoRRA can predict answers from a fixed vocabulary or directly from the detected text. The model outperforms existing state-of-the-art VQA models on the TextVQA dataset, highlighting the significant gap between human and machine performance in this domain. The paper also discusses the limitations of current VQA models and suggests future directions for improving text detection and reasoning in unconstrained environments.
Reach us at info@study.space
[slides] Towards VQA Models That Can Read | StudySpace