This paper proposes a novel method called Invariant Representation Learning for Text-Based Person Retrieval (IRLT) to improve the accuracy and reliability of text-based person retrieval (TPR). TPR aims to retrieve images of specific pedestrians based on textual queries. Existing methods primarily rely on pre-trained deep neural networks to learn cross-modal alignments between visual and textual data, but they often fail to account for causal relationships between text and images, leading to unreliable retrieval results under varying environmental conditions.
The proposed IRLT method takes a causal perspective, assuming that images consist of causal factors (semantically consistent with text) and non-causal factors (irrelevant to retrieval). The goal is to extract causal factors that are robust to environmental changes and sufficient for accurate retrieval. To achieve this, IRLT introduces two key components: a style intervener and a scene simulator. The style intervener ensures that visual representations are invariant to non-causal factors, while the scene simulator ensures that representations are sufficient for retrieval across different environments.
The style intervener simulates variations in non-causal factors by modeling feature uncertainty, forcing the model to learn representations that are independent of these factors. The scene simulator places pedestrian images in similar and dissimilar environments to ensure that the model learns sufficient causal factors for accurate retrieval. IRLT is model-agnostic and can be integrated with existing TPR methods.
Extensive experiments on three benchmark datasets demonstrate that IRLT outperforms existing methods in terms of accuracy and generalization. The method is robust to various environmental changes and achieves significant improvements in retrieval performance. The results show that IRLT effectively captures causally invariant visual-linguistic correlations, leading to more accurate and reliable text-based person retrieval.This paper proposes a novel method called Invariant Representation Learning for Text-Based Person Retrieval (IRLT) to improve the accuracy and reliability of text-based person retrieval (TPR). TPR aims to retrieve images of specific pedestrians based on textual queries. Existing methods primarily rely on pre-trained deep neural networks to learn cross-modal alignments between visual and textual data, but they often fail to account for causal relationships between text and images, leading to unreliable retrieval results under varying environmental conditions.
The proposed IRLT method takes a causal perspective, assuming that images consist of causal factors (semantically consistent with text) and non-causal factors (irrelevant to retrieval). The goal is to extract causal factors that are robust to environmental changes and sufficient for accurate retrieval. To achieve this, IRLT introduces two key components: a style intervener and a scene simulator. The style intervener ensures that visual representations are invariant to non-causal factors, while the scene simulator ensures that representations are sufficient for retrieval across different environments.
The style intervener simulates variations in non-causal factors by modeling feature uncertainty, forcing the model to learn representations that are independent of these factors. The scene simulator places pedestrian images in similar and dissimilar environments to ensure that the model learns sufficient causal factors for accurate retrieval. IRLT is model-agnostic and can be integrated with existing TPR methods.
Extensive experiments on three benchmark datasets demonstrate that IRLT outperforms existing methods in terms of accuracy and generalization. The method is robust to various environmental changes and achieves significant improvements in retrieval performance. The results show that IRLT effectively captures causally invariant visual-linguistic correlations, leading to more accurate and reliable text-based person retrieval.