2024 | Zhongtian Fu, Kefei Song, Luping Zhou, Yang Yang
Noise-Aware Image Captioning with Progressively Exploring Mismatched Words
This paper proposes a Noise-Aware Image Captioning (NIC) method to address the challenge of noisy image-text pairs in image captioning. Unlike traditional noisy label learning, NIC focuses on identifying mismatched words at a fine-grained level to effectively utilize trustworthy information in the text. NIC adaptively mitigates the impact of noise by progressively exploring mismatched words and constructing pseudo-word-labels based on word-label reliability and model convergence. The method considers both inter-modal representativeness and intra-modal informativeness to assess word-label reliability. Inter-modal representativeness measures the significance of the current word by assessing cross-modal correlation via prediction certainty, while intra-modal informativeness amplifies the effect of current prediction by combining the quality of subsequent word generation. During optimization, NIC constructs pseudo-word-labels to coordinate mismatched words, enhancing robustness and accuracy. NIC is validated on the MS-COCO and Conceptual Caption datasets, demonstrating its effectiveness in various noisy scenarios. The method improves performance in all noisy scenes, showing a 6.2/5.3 improvement in CIDEr metrics and a 1.8/1.5 improvement in SPICE metrics compared to the baseline PureT. NIC's noise robustness relies on accurately identifying noisy words, which is validated through experiments on different noise levels and captioning models. The results show that NIC outperforms existing state-of-the-art image captioning methods in all metrics, indicating its excellent robustness in reality. The method is generalizable and can be applied to any existing image captioning model. The overall time complexity is O(M(M²d_model + Md_model² + |Z|O)), where M is the number of visual regions, |Z| is the vocabulary size. The paper concludes that NIC effectively handles diverse noisy scenarios and can be easily integrated into any state-of-the-art image captioning method.Noise-Aware Image Captioning with Progressively Exploring Mismatched Words
This paper proposes a Noise-Aware Image Captioning (NIC) method to address the challenge of noisy image-text pairs in image captioning. Unlike traditional noisy label learning, NIC focuses on identifying mismatched words at a fine-grained level to effectively utilize trustworthy information in the text. NIC adaptively mitigates the impact of noise by progressively exploring mismatched words and constructing pseudo-word-labels based on word-label reliability and model convergence. The method considers both inter-modal representativeness and intra-modal informativeness to assess word-label reliability. Inter-modal representativeness measures the significance of the current word by assessing cross-modal correlation via prediction certainty, while intra-modal informativeness amplifies the effect of current prediction by combining the quality of subsequent word generation. During optimization, NIC constructs pseudo-word-labels to coordinate mismatched words, enhancing robustness and accuracy. NIC is validated on the MS-COCO and Conceptual Caption datasets, demonstrating its effectiveness in various noisy scenarios. The method improves performance in all noisy scenes, showing a 6.2/5.3 improvement in CIDEr metrics and a 1.8/1.5 improvement in SPICE metrics compared to the baseline PureT. NIC's noise robustness relies on accurately identifying noisy words, which is validated through experiments on different noise levels and captioning models. The results show that NIC outperforms existing state-of-the-art image captioning methods in all metrics, indicating its excellent robustness in reality. The method is generalizable and can be applied to any existing image captioning model. The overall time complexity is O(M(M²d_model + Md_model² + |Z|O)), where M is the number of visual regions, |Z| is the vocabulary size. The paper concludes that NIC effectively handles diverse noisy scenarios and can be easily integrated into any state-of-the-art image captioning method.