2010 | Ali Farhadi1, Mohsen Hejrati2, Mohammad Amin Sadeghi2, Peter Young1, Cyrus Rashtchian1, Julia Hockenmaier1, David Forsyth1
The paper "Every Picture Tells a Story: Generating Sentences from Images" by Ali Farhadi et al. presents a system that can generate descriptive sentences from images and vice versa. The authors introduce an intermediate space of meanings, represented as triplets of $(object, action, scene)$, to map images and sentences to a common space. This mapping is learned using discriminative procedures and evaluated on a novel dataset of human-annotated images. The system uses a combination of detectors, classifiers, and distributional semantics to handle out-of-vocabulary words and synecdoche. The evaluation includes quantitative measures such as Tree-F1 and BLUE scores, showing that the system can produce accurate and concise sentences that capture the essence of the images. The paper also discusses the challenges and future directions in this field, emphasizing the importance of iterative refinement in generating more complex sentences.The paper "Every Picture Tells a Story: Generating Sentences from Images" by Ali Farhadi et al. presents a system that can generate descriptive sentences from images and vice versa. The authors introduce an intermediate space of meanings, represented as triplets of $(object, action, scene)$, to map images and sentences to a common space. This mapping is learned using discriminative procedures and evaluated on a novel dataset of human-annotated images. The system uses a combination of detectors, classifiers, and distributional semantics to handle out-of-vocabulary words and synecdoche. The evaluation includes quantitative measures such as Tree-F1 and BLUE scores, showing that the system can produce accurate and concise sentences that capture the essence of the images. The paper also discusses the challenges and future directions in this field, emphasizing the importance of iterative refinement in generating more complex sentences.