2024 | Zhongtian Fu*, Kefei Song*, Luping Zhou, Yang Yang
The paper "Noise-Aware Image Captioning with Progressively Exploring Mismatched Words" addresses the challenge of noisy image-text pairs in image captioning tasks. The authors propose a method called Noise-aware Image Captioning (NIC) to adaptively mitigate the impact of noisy data by progressively exploring and identifying mismatched words. NIC evaluates the reliability of word-labels from two aspects: inter-modal representativeness, which measures the relevance of the current word to the image, and intra-modal informativeness, which assesses the quality of subsequent word generation. During optimization, NIC constructs pseudo-word-labels based on the reliability of the original word-labels and model convergence, enhancing the robustness and accuracy of the learning process. Extensive experiments on the MS-COCO and Conceptual Caption datasets demonstrate the effectiveness of NIC in various noisy scenarios, showing significant improvements over existing methods. The proposed method is flexible and can be integrated into any state-of-the-art image captioning model, making it a valuable contribution to the field.The paper "Noise-Aware Image Captioning with Progressively Exploring Mismatched Words" addresses the challenge of noisy image-text pairs in image captioning tasks. The authors propose a method called Noise-aware Image Captioning (NIC) to adaptively mitigate the impact of noisy data by progressively exploring and identifying mismatched words. NIC evaluates the reliability of word-labels from two aspects: inter-modal representativeness, which measures the relevance of the current word to the image, and intra-modal informativeness, which assesses the quality of subsequent word generation. During optimization, NIC constructs pseudo-word-labels based on the reliability of the original word-labels and model convergence, enhancing the robustness and accuracy of the learning process. Extensive experiments on the MS-COCO and Conceptual Caption datasets demonstrate the effectiveness of NIC in various noisy scenarios, showing significant improvements over existing methods. The proposed method is flexible and can be integrated into any state-of-the-art image captioning model, making it a valuable contribution to the field.