[slides and audio] Polos%3A Multimodal Metric Learning from Human Feedback for Image Captioning

**Abstract:** Establishing an automatic evaluation metric that closely aligns with human judgments is crucial for advancing image captioning models. Recent data-driven metrics have shown stronger correlations with human judgments than classic metrics like CIDEr, but they struggle with hallucinations and generalization across diverse images and texts. This study introduces Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, the authors introduce Multimodal Metric Learning from Human Feedback (M²LHF), a framework for developing metrics based on human feedback. The Polaris dataset, comprising 131K human judgments from 550 evaluators, is constructed to enhance the robustness of Polos. The proposed approach achieves state-of-the-art performance on various benchmarks, demonstrating its effectiveness and robustness. **Introduction:** Image captioning has wide-ranging practical applications, from assisting visually impaired individuals to facilitating dialog about images. Establishing an automatic evaluation metric that aligns with human judgments is essential for advancing image captioning models. While classic metrics like CIDEr have weak correlations with human judgments, data-driven metrics have shown better performance. However, these metrics often fail to handle hallucinations and generalize across diverse images and texts due to their reliance on scalar similarities computed from embeddings learned from unrelated tasks. **Methodology:** The proposed metric, Polos, integrates similarity-based and learning-based approaches, modeling intricate relationships within text-image pairs and text-text pairs. It uses a parallel feature extraction mechanism that combines CLIP and RoBERTa embeddings. The M²LHF framework is introduced to develop practical supervised metrics based on human feedback. The Polaris dataset, containing 131K human judgments, is constructed to enhance the diversity and range of evaluations. **Experimental Evaluation:** The Polos metric is evaluated on various benchmarks, including Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset. The results show that Polos outperforms existing metrics in terms of correlation with human judgments and zero-shot performance. Ablation studies validate the effectiveness of the parallel feature extraction mechanism and the M²LHF framework. **Conclusion:** This paper introduces Polos, a supervised automatic evaluation metric for image captioning. It contributes to the development of a practical metric through M²LHF, a parallel feature extraction mechanism, and the construction of the Polaris dataset. Polos achieves state-of-the-art performance on multiple benchmarks, demonstrating its effectiveness and robustness.**Abstract:** Establishing an automatic evaluation metric that closely aligns with human judgments is crucial for advancing image captioning models. Recent data-driven metrics have shown stronger correlations with human judgments than classic metrics like CIDEr, but they struggle with hallucinations and generalization across diverse images and texts. This study introduces Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, the authors introduce Multimodal Metric Learning from Human Feedback (M²LHF), a framework for developing metrics based on human feedback. The Polaris dataset, comprising 131K human judgments from 550 evaluators, is constructed to enhance the robustness of Polos. The proposed approach achieves state-of-the-art performance on various benchmarks, demonstrating its effectiveness and robustness. **Introduction:** Image captioning has wide-ranging practical applications, from assisting visually impaired individuals to facilitating dialog about images. Establishing an automatic evaluation metric that aligns with human judgments is essential for advancing image captioning models. While classic metrics like CIDEr have weak correlations with human judgments, data-driven metrics have shown better performance. However, these metrics often fail to handle hallucinations and generalize across diverse images and texts due to their reliance on scalar similarities computed from embeddings learned from unrelated tasks. **Methodology:** The proposed metric, Polos, integrates similarity-based and learning-based approaches, modeling intricate relationships within text-image pairs and text-text pairs. It uses a parallel feature extraction mechanism that combines CLIP and RoBERTa embeddings. The M²LHF framework is introduced to develop practical supervised metrics based on human feedback. The Polaris dataset, containing 131K human judgments, is constructed to enhance the diversity and range of evaluations. **Experimental Evaluation:** The Polos metric is evaluated on various benchmarks, including Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset. The results show that Polos outperforms existing metrics in terms of correlation with human judgments and zero-shot performance. Ablation studies validate the effectiveness of the parallel feature extraction mechanism and the M²LHF framework. **Conclusion:** This paper introduces Polos, a supervised automatic evaluation metric for image captioning. It contributes to the development of a practical metric through M²LHF, a parallel feature extraction mechanism, and the construction of the Polaris dataset. Polos achieves state-of-the-art performance on multiple benchmarks, demonstrating its effectiveness and robustness.

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

28 Feb 2024 | Yuiga Wada, Kanta Kaneda, Daichi Saito, Komei Sugiura