[slides and audio] Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval

The paper "Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval" addresses the challenge of text-based person retrieval, which aims to identify a specific pedestrian image from a gallery using textual descriptions. The primary issue is the heterogeneous modality gap due to significant intra-class variation and minimal inter-class variation. Existing methods often use vision-language pre-training or attention mechanisms to learn cross-modal alignments, but they suffer from matching ambiguity and one-sided cross-modal alignments. To tackle these issues, the authors propose a novel framework called Adaptive Uncertainty-based Learning (AUL). AUL consists of three key components: 1. **Uncertainty-aware Matching Filtration (UMF)**: This component uses Subjective Logic to model matching uncertainty and adaptively filter out unreliable matching pairs, selecting high-confidence cross-modal matches. 2. **Uncertainty-based Alignment Refinement (UAR)**: UAR explores one-to-many correspondence by constructing uncertainty representations and performs progressive learning to integrate coarse- and fine-grained alignments. 3. **Cross-modal Masked Modeling (CMM)**: CMM enhances the interaction between image and text by reconstructing signals through masked input, exploring deeper relations between the two modalities. The authors evaluate their method on three benchmark datasets (CUHK-PEDES, ICFG-PEDES, and RSTPReid) in supervised, weakly supervised, and domain generalization settings. Extensive experiments show that AUL consistently outperforms state-of-the-art methods, demonstrating its effectiveness in improving retrieval performance by addressing matching ambiguity and enhancing cross-modal alignment. The code for the AUL method is available at https://github.com/CFM-MSG/Code-AUL.The paper "Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval" addresses the challenge of text-based person retrieval, which aims to identify a specific pedestrian image from a gallery using textual descriptions. The primary issue is the heterogeneous modality gap due to significant intra-class variation and minimal inter-class variation. Existing methods often use vision-language pre-training or attention mechanisms to learn cross-modal alignments, but they suffer from matching ambiguity and one-sided cross-modal alignments. To tackle these issues, the authors propose a novel framework called Adaptive Uncertainty-based Learning (AUL). AUL consists of three key components: 1. **Uncertainty-aware Matching Filtration (UMF)**: This component uses Subjective Logic to model matching uncertainty and adaptively filter out unreliable matching pairs, selecting high-confidence cross-modal matches. 2. **Uncertainty-based Alignment Refinement (UAR)**: UAR explores one-to-many correspondence by constructing uncertainty representations and performs progressive learning to integrate coarse- and fine-grained alignments. 3. **Cross-modal Masked Modeling (CMM)**: CMM enhances the interaction between image and text by reconstructing signals through masked input, exploring deeper relations between the two modalities. The authors evaluate their method on three benchmark datasets (CUHK-PEDES, ICFG-PEDES, and RSTPReid) in supervised, weakly supervised, and domain generalization settings. Extensive experiments show that AUL consistently outperforms state-of-the-art methods, demonstrating its effectiveness in improving retrieval performance by addressing matching ambiguity and enhancing cross-modal alignment. The code for the AUL method is available at https://github.com/CFM-MSG/Code-AUL.

Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval

2024 | Shenshen Li, Chen He, Xing Xu*, Fumin Shen, Yang Yang, Heng Tao Shen