Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

2024-07-01 | Wentao Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, Dapeng Tao
This paper addresses the challenge of text-to-image person re-identification (ReID) by leveraging Multi-modal Large Language Models (MLLMs). The authors propose a method to generate diverse and accurate textual descriptions for pedestrian images, which are then used to train a model for ReID. The key contributions include: 1. **Template-based Diversity Enhancement (TDE)**: This method enhances the diversity of textual descriptions by using a dynamic instruction that varies based on templates generated from multi-turn dialogues with a Large Language Model (LLM). This approach reduces overfitting to specific sentence patterns and improves the model's generalization. 2. **Noise-aware Masking (NAM)**: This method identifies and masks noisy words in the textual descriptions, reducing their impact on model training. It calculates the similarity between text tokens and image tokens to estimate the noise level and applies masking probabilities accordingly. 3. **Dataset Construction**: The authors use the LUPerson dataset as the image source and generate textual descriptions using MLLMs. The resulting dataset, LUPerson-MLLM, contains 1.0 million images with four captions per image, including static and dynamic descriptions. 4. **Model Training and Evaluation**: The model is trained using the CLIP-ViT/B-16 backbone and the similarity distribution matching (SDM) loss. Experiments on three benchmarks (CUHK-PEDES, ICFG-PEDES, and RSTPReid) demonstrate the effectiveness of the proposed methods, showing significant improvements in direct transfer and fine-tuning settings. The paper also discusses related works in text-to-image ReID and multi-modal large language models, and provides a detailed experimental setup and results. The authors conclude by highlighting the limitations of their methods and future directions, emphasizing the need for more robust approaches to handle diversity and noise in MLLM-generated descriptions.This paper addresses the challenge of text-to-image person re-identification (ReID) by leveraging Multi-modal Large Language Models (MLLMs). The authors propose a method to generate diverse and accurate textual descriptions for pedestrian images, which are then used to train a model for ReID. The key contributions include: 1. **Template-based Diversity Enhancement (TDE)**: This method enhances the diversity of textual descriptions by using a dynamic instruction that varies based on templates generated from multi-turn dialogues with a Large Language Model (LLM). This approach reduces overfitting to specific sentence patterns and improves the model's generalization. 2. **Noise-aware Masking (NAM)**: This method identifies and masks noisy words in the textual descriptions, reducing their impact on model training. It calculates the similarity between text tokens and image tokens to estimate the noise level and applies masking probabilities accordingly. 3. **Dataset Construction**: The authors use the LUPerson dataset as the image source and generate textual descriptions using MLLMs. The resulting dataset, LUPerson-MLLM, contains 1.0 million images with four captions per image, including static and dynamic descriptions. 4. **Model Training and Evaluation**: The model is trained using the CLIP-ViT/B-16 backbone and the similarity distribution matching (SDM) loss. Experiments on three benchmarks (CUHK-PEDES, ICFG-PEDES, and RSTPReid) demonstrate the effectiveness of the proposed methods, showing significant improvements in direct transfer and fine-tuning settings. The paper also discusses related works in text-to-image ReID and multi-modal large language models, and provides a detailed experimental setup and results. The authors conclude by highlighting the limitations of their methods and future directions, emphasizing the need for more robust approaches to handle diversity and noise in MLLM-generated descriptions.
Reach us at info@study.space
[slides and audio] Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID