2013 | Micah Hodosh, Peter Young, Julia Hockenmaier
This paper introduces a new benchmark for sentence-based image description and search, consisting of 8,000 images each paired with five captions that describe the image's content. The authors argue that evaluating image description as a ranking task is more effective than generating captions, as it allows for a more direct assessment of how well images can be associated with descriptive sentences. They propose using ranking-based evaluation metrics, which are more robust than single-response metrics, and show that these metrics correlate well with human judgments. The paper also presents two image description systems: one based on nearest-neighbor search and another using Kernel Canonical Correlation Analysis (KCCA). The systems use different image and text kernels to capture visual and linguistic features. The authors evaluate their systems on a new data set of 8,000 images and show that their ranking-based approach outperforms traditional caption generation methods. They also compare human and automatic evaluation metrics and find that human judgments are more reliable for assessing the quality of image descriptions. The paper concludes that ranking-based image description systems can be evaluated automatically, and that the proposed benchmark provides a valuable resource for the community.This paper introduces a new benchmark for sentence-based image description and search, consisting of 8,000 images each paired with five captions that describe the image's content. The authors argue that evaluating image description as a ranking task is more effective than generating captions, as it allows for a more direct assessment of how well images can be associated with descriptive sentences. They propose using ranking-based evaluation metrics, which are more robust than single-response metrics, and show that these metrics correlate well with human judgments. The paper also presents two image description systems: one based on nearest-neighbor search and another using Kernel Canonical Correlation Analysis (KCCA). The systems use different image and text kernels to capture visual and linguistic features. The authors evaluate their systems on a new data set of 8,000 images and show that their ranking-based approach outperforms traditional caption generation methods. They also compare human and automatic evaluation metrics and find that human judgments are more reliable for assessing the quality of image descriptions. The paper concludes that ranking-based image description systems can be evaluated automatically, and that the proposed benchmark provides a valuable resource for the community.