| Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg
The paper presents a system for automatically generating natural language descriptions from images, leveraging both statistical models derived from large text corpora and computer vision algorithms. The system effectively produces relevant sentences that are more accurate to the specific image content compared to previous methods. It uses a Conditional Random Field (CRF) to predict the best labeling of image content, including objects, attributes, and spatial relationships. The CRF combines unary image potentials and higher-order text-based potentials to generate sentences. The system's effectiveness is validated through human evaluation, which shows high scores for both the quality of generated sentences and the accuracy of image content description. The key contributions include the automatic mining of visually descriptive language and the integration of advanced computer vision techniques.The paper presents a system for automatically generating natural language descriptions from images, leveraging both statistical models derived from large text corpora and computer vision algorithms. The system effectively produces relevant sentences that are more accurate to the specific image content compared to previous methods. It uses a Conditional Random Field (CRF) to predict the best labeling of image content, including objects, attributes, and spatial relationships. The CRF combines unary image potentials and higher-order text-based potentials to generate sentences. The system's effectiveness is validated through human evaluation, which shows high scores for both the quality of generated sentences and the accuracy of image content description. The key contributions include the automatic mining of visually descriptive language and the integration of advanced computer vision techniques.