15 Mar 2024 | Pingping Zhang, Yuhao Wang, Yang Liu, Zhengzheng Tu, Huchuan Lu
This paper proposes a novel feature learning framework named EDITOR for multi-modal object re-identification (ReID). The framework aims to select diverse tokens from vision Transformers to enhance the performance of multi-modal ReID. The key components of EDITOR include a shared vision Transformer for feature extraction, a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens, and a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions across modalities. Additionally, the framework introduces two new loss functions, Background Consistency Constraint (BCC) and Object-Centric Feature Refinement (OCFR), to suppress background effects and improve feature discrimination. The proposed framework is evaluated on three multi-modal ReID benchmarks, RGBNT201, RGBNT100, and MSVR310, demonstrating its effectiveness in enhancing the performance of multi-modal object ReID. The results show that EDITOR outperforms existing methods in terms of mAP and rank metrics, indicating its potential for practical applications in complex visual scenarios. The framework's ability to select diverse tokens and suppress background effects makes it a promising approach for multi-modal object ReID.This paper proposes a novel feature learning framework named EDITOR for multi-modal object re-identification (ReID). The framework aims to select diverse tokens from vision Transformers to enhance the performance of multi-modal ReID. The key components of EDITOR include a shared vision Transformer for feature extraction, a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens, and a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions across modalities. Additionally, the framework introduces two new loss functions, Background Consistency Constraint (BCC) and Object-Centric Feature Refinement (OCFR), to suppress background effects and improve feature discrimination. The proposed framework is evaluated on three multi-modal ReID benchmarks, RGBNT201, RGBNT100, and MSVR310, demonstrating its effectiveness in enhancing the performance of multi-modal object ReID. The results show that EDITOR outperforms existing methods in terms of mAP and rank metrics, indicating its potential for practical applications in complex visual scenarios. The framework's ability to select diverse tokens and suppress background effects makes it a promising approach for multi-modal object ReID.