1 Aug 2017 | Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra
The paper introduces the task of Visual Dialog, which involves an AI agent engaging in meaningful conversations with humans about visual content. The agents must ground their responses in the image, infer context from past dialogues, and provide accurate answers to questions. To evaluate this task, the authors develop a large-scale dataset called VisDial, which includes 1 dialog with 10 question-answer pairs on approximately 120,000 images from the COCO dataset. They propose a retrieval-based evaluation protocol where the AI agent sorts candidate answers and is evaluated on metrics such as mean-reciprocal-rank of human responses. The paper also introduces a family of neural encoder-decoder models, including Late Fusion, Hierarchical Recurrent Encoder, and Memory Network encoders, and generative and discriminative decoders, which outperform several sophisticated baselines. The authors conduct human studies to quantify the gap between machine and human performance on the Visual Dialog task. Overall, the paper demonstrates the first visual chatbot and provides a comprehensive dataset, code, trained models, and a visual chatbot at <https://visualdialog.org>.The paper introduces the task of Visual Dialog, which involves an AI agent engaging in meaningful conversations with humans about visual content. The agents must ground their responses in the image, infer context from past dialogues, and provide accurate answers to questions. To evaluate this task, the authors develop a large-scale dataset called VisDial, which includes 1 dialog with 10 question-answer pairs on approximately 120,000 images from the COCO dataset. They propose a retrieval-based evaluation protocol where the AI agent sorts candidate answers and is evaluated on metrics such as mean-reciprocal-rank of human responses. The paper also introduces a family of neural encoder-decoder models, including Late Fusion, Hierarchical Recurrent Encoder, and Memory Network encoders, and generative and discriminative decoders, which outperform several sophisticated baselines. The authors conduct human studies to quantify the gap between machine and human performance on the Visual Dialog task. Overall, the paper demonstrates the first visual chatbot and provides a comprehensive dataset, code, trained models, and a visual chatbot at <https://visualdialog.org>.