Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

28 Feb 2024 | Yuiga Wada, Kanta Kaneda, Daichi Saito, Komei Sugiura
This paper introduces Polos, a supervised automatic evaluation metric for image captioning models. The proposed metric leverages multimodal inputs and human feedback to compute scores that closely align with human judgments. Polos uses a parallel feature extraction mechanism that combines text embeddings trained with SimCSE and vision-language embeddings from CLIP. To train Polos, the authors introduce Multimodal Metric Learning from Human Feedback (M²LHF), a framework that enables the development of a practical supervised metric for image captioning. The Polaris dataset, which contains 131,020 human judgments from 550 evaluators, is constructed to provide a diverse range of captions and evaluations. Polos achieves state-of-the-art performance on several image captioning benchmarks, including Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset. The proposed metric outperforms existing metrics in terms of correlation with human judgments and handles hallucinations more effectively. The method is robust and practical, making it a potent automatic evaluation metric for image captioning. The paper also discusses the limitations of the proposed method, including its tendency to overestimate captions that lack intricate details. The authors believe that this study represents a significant step toward the development of a more practical metric for image captioning models.This paper introduces Polos, a supervised automatic evaluation metric for image captioning models. The proposed metric leverages multimodal inputs and human feedback to compute scores that closely align with human judgments. Polos uses a parallel feature extraction mechanism that combines text embeddings trained with SimCSE and vision-language embeddings from CLIP. To train Polos, the authors introduce Multimodal Metric Learning from Human Feedback (M²LHF), a framework that enables the development of a practical supervised metric for image captioning. The Polaris dataset, which contains 131,020 human judgments from 550 evaluators, is constructed to provide a diverse range of captions and evaluations. Polos achieves state-of-the-art performance on several image captioning benchmarks, including Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset. The proposed metric outperforms existing metrics in terms of correlation with human judgments and handles hallucinations more effectively. The method is robust and practical, making it a potent automatic evaluation metric for image captioning. The paper also discusses the limitations of the proposed method, including its tendency to overestimate captions that lack intricate details. The authors believe that this study represents a significant step toward the development of a more practical metric for image captioning models.
Reach us at info@study.space