| Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg
This paper presents a system for automatically generating natural language descriptions of images by combining statistical models from text data with computer vision techniques. The system effectively produces relevant sentences that are more accurate than previous methods. It detects objects, modifiers, and spatial relationships in images, then uses these to generate sentences. The system uses a Conditional Random Field (CRF) to model the relationships between image content and sentence generation. The CRF incorporates both image-based and text-based potentials, which are derived from large text corpora. The system also uses a template-based approach for sentence generation, which produces high-quality descriptions. The system is evaluated using both automatic and human evaluations, with human evaluations showing that the generated sentences are of high quality. The system demonstrates that automatically mining and parsing large text collections can provide valuable statistical models for visually descriptive language, and that combining these with state-of-the-art vision systems can produce high-quality image descriptions. The results show that the system produces descriptions that are more specific to the image content than previous automated methods.This paper presents a system for automatically generating natural language descriptions of images by combining statistical models from text data with computer vision techniques. The system effectively produces relevant sentences that are more accurate than previous methods. It detects objects, modifiers, and spatial relationships in images, then uses these to generate sentences. The system uses a Conditional Random Field (CRF) to model the relationships between image content and sentence generation. The CRF incorporates both image-based and text-based potentials, which are derived from large text corpora. The system also uses a template-based approach for sentence generation, which produces high-quality descriptions. The system is evaluated using both automatic and human evaluations, with human evaluations showing that the generated sentences are of high quality. The system demonstrates that automatically mining and parsing large text collections can provide valuable statistical models for visually descriptive language, and that combining these with state-of-the-art vision systems can produce high-quality image descriptions. The results show that the system produces descriptions that are more specific to the image content than previous automated methods.