Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

1 Jul 2024 | Wentao Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, Dapeng Tao
This paper proposes a method to enhance the transferability of text-to-image person re-identification (ReID) by leveraging multi-modal large language models (MLLMs). The key challenges addressed are generating diverse textual descriptions and reducing the impact of noise in these descriptions. To generate diverse descriptions, the authors propose a Template-based Diversity Enhancement (TDE) method, which uses templates generated through multi-turn dialogues with a large language model (LLM) to produce varied descriptions. To reduce noise, they introduce a Noise-aware Masking (NAM) method that identifies and masks noisy words in the descriptions, improving the model's ability to align visual and textual features. The authors utilize the LUPerson dataset as the image source and generate textual descriptions using MLLMs. They then train a model on these descriptions, which are used to evaluate performance on existing text-to-image ReID databases. The TDE method significantly enhances the diversity of the generated descriptions, while the NAM method effectively reduces the impact of noisy descriptions by masking potentially incorrect words during training. The experiments show that the proposed methods significantly improve the performance of text-to-image ReID in both direct transfer and traditional evaluation settings. The model achieves state-of-the-art results, demonstrating the effectiveness of the TDE and NAM methods in enhancing the transferability of text-to-image ReID. The results indicate that the proposed methods are effective in addressing the challenges of generating diverse and noise-free textual descriptions, leading to improved performance in real-world applications.This paper proposes a method to enhance the transferability of text-to-image person re-identification (ReID) by leveraging multi-modal large language models (MLLMs). The key challenges addressed are generating diverse textual descriptions and reducing the impact of noise in these descriptions. To generate diverse descriptions, the authors propose a Template-based Diversity Enhancement (TDE) method, which uses templates generated through multi-turn dialogues with a large language model (LLM) to produce varied descriptions. To reduce noise, they introduce a Noise-aware Masking (NAM) method that identifies and masks noisy words in the descriptions, improving the model's ability to align visual and textual features. The authors utilize the LUPerson dataset as the image source and generate textual descriptions using MLLMs. They then train a model on these descriptions, which are used to evaluate performance on existing text-to-image ReID databases. The TDE method significantly enhances the diversity of the generated descriptions, while the NAM method effectively reduces the impact of noisy descriptions by masking potentially incorrect words during training. The experiments show that the proposed methods significantly improve the performance of text-to-image ReID in both direct transfer and traditional evaluation settings. The model achieves state-of-the-art results, demonstrating the effectiveness of the TDE and NAM methods in enhancing the transferability of text-to-image ReID. The results indicate that the proposed methods are effective in addressing the challenges of generating diverse and noise-free textual descriptions, leading to improved performance in real-world applications.
Reach us at info@study.space