03/13; published 08/13 | Micah Hodosh, Peter Young, Julia Hockenmaier
The paper proposes a new benchmark for sentence-based image description and search, consisting of 8,000 images each paired with five different captions. It introduces systems that perform well on this task, even with minimal supervision. The study emphasizes the importance of training on multiple captions per image and capturing syntactic and semantic features of these captions. The evaluation metrics are compared, showing that metrics considering the ranked list of results for each query image or sentence are more robust than those based on a single response per query. The paper also discusses the need for a new dataset and the challenges of existing datasets, proposing a crowdsourcing approach to collect high-quality captions. The evaluation metrics are further analyzed, suggesting that automated evaluation of ranking-based systems may be feasible.The paper proposes a new benchmark for sentence-based image description and search, consisting of 8,000 images each paired with five different captions. It introduces systems that perform well on this task, even with minimal supervision. The study emphasizes the importance of training on multiple captions per image and capturing syntactic and semantic features of these captions. The evaluation metrics are compared, showing that metrics considering the ranked list of results for each query image or sentence are more robust than those based on a single response per query. The paper also discusses the need for a new dataset and the challenges of existing datasets, proposing a crowdsourcing approach to collect high-quality captions. The evaluation metrics are further analyzed, suggesting that automated evaluation of ranking-based systems may be feasible.