1 Aug 2017 | Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra
We introduce the Visual Dialog task, which requires an AI agent to engage in meaningful dialog with humans in natural language about visual content. Given an image, a dialog history, and a question about the image, the agent must ground the question in the image, infer context from the history, and answer the question accurately. Visual Dialog is a general test of machine intelligence, grounded in vision to allow objective evaluation of responses. We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial), which contains 1 dialog with 10 question-answer pairs on ~120k images from COCO, with a total of ~1.2M dialog question-answer pairs. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders – Late Fusion, Hierarchical Recurrent Encoder, and Memory Network – and 2 decoders (generative and discriminative), which outperform baselines. We propose a retrieval-based evaluation protocol where the AI agent is asked to sort candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response. We quantify the gap between machine and human performance on the Visual Dialog task via human studies. We demonstrate the first 'visual chatbot' and provide our dataset, code, trained models, and chatbot on https://visualdialog.org. The VisDial dataset contains 1 dialog (10 QA pairs) on ~123k images from COCO-train/val, a total of 1,232,870 QA pairs. VisDial questions are longer and more descriptive than previous datasets, with a significant heavy tail in answer diversity. VisDial answers include expressions of doubt, uncertainty, or lack of information. VisDial has co-references in dialogs, with pronouns used in 38% of questions, 19% of answers, and nearly all dialogs. VisDial has smoothness/continuity in 'topics', with questions following a pattern of open-ended exploration. VisDial has the statistics of an NLP dialog dataset, with lower perplexity in language models trained on VisDial compared to VQA and Cornell Movie-Dialogs Corpus. Our results show that models better encoding history (MN/HRE) perform better than corresponding LF models. We conduct human studies to evaluate human performance on this task for all combinations of {with image, without image}×{with history, without history}. Our results indicate significant scope for improvement in Visual Dialog.We introduce the Visual Dialog task, which requires an AI agent to engage in meaningful dialog with humans in natural language about visual content. Given an image, a dialog history, and a question about the image, the agent must ground the question in the image, infer context from the history, and answer the question accurately. Visual Dialog is a general test of machine intelligence, grounded in vision to allow objective evaluation of responses. We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial), which contains 1 dialog with 10 question-answer pairs on ~120k images from COCO, with a total of ~1.2M dialog question-answer pairs. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders – Late Fusion, Hierarchical Recurrent Encoder, and Memory Network – and 2 decoders (generative and discriminative), which outperform baselines. We propose a retrieval-based evaluation protocol where the AI agent is asked to sort candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response. We quantify the gap between machine and human performance on the Visual Dialog task via human studies. We demonstrate the first 'visual chatbot' and provide our dataset, code, trained models, and chatbot on https://visualdialog.org. The VisDial dataset contains 1 dialog (10 QA pairs) on ~123k images from COCO-train/val, a total of 1,232,870 QA pairs. VisDial questions are longer and more descriptive than previous datasets, with a significant heavy tail in answer diversity. VisDial answers include expressions of doubt, uncertainty, or lack of information. VisDial has co-references in dialogs, with pronouns used in 38% of questions, 19% of answers, and nearly all dialogs. VisDial has smoothness/continuity in 'topics', with questions following a pattern of open-ended exploration. VisDial has the statistics of an NLP dialog dataset, with lower perplexity in language models trained on VisDial compared to VQA and Cornell Movie-Dialogs Corpus. Our results show that models better encoding history (MN/HRE) perform better than corresponding LF models. We conduct human studies to evaluate human performance on this task for all combinations of {with image, without image}×{with history, without history}. Our results indicate significant scope for improvement in Visual Dialog.