SDSTrack is a novel symmetric multimodal tracking framework designed to enhance the robustness and accuracy of visual object tracking (VOT) in complex environments. The framework addresses the limitations of traditional RGB-based trackers by introducing lightweight adaptation techniques and a complementary masked patch distillation strategy. Key contributions include:
1. **Symmetric Multimodal Adaptation (SMA)**: This technique efficiently transfers the feature extraction capability from RGB to other modalities (e.g., Depth, Thermal, Event) using adapters, ensuring balanced and symmetric feature fusion.
2. **Complementary Masked Patch Distillation**: This strategy enhances the model's robustness by randomly masking patches in one modality and distilling the knowledge from clean and masked data, improving the model's ability to handle extreme conditions.
3. **Parameter-Efficient Fine-Tuning**: The framework employs parameter-efficient fine-tuning (PEFT) to minimize training parameters, making it suitable for limited multimodal data.
4. **Performance on Multiple Datasets**: Extensive experiments on various benchmarks (DepthTrack, VOT-RGBD2022, RGBT234, VisEvent) demonstrate that SDSTrack outperforms state-of-the-art methods, achieving new SOTA performance in precision, recall, and F-score.
5. **Robustness in Extreme Conditions**: The method shows superior robustness in challenging scenarios, such as missing or occluded modalities, with significant improvements in precision and success rates.
6. **Inference Speed**: SDSTrack achieves real-time tracking (20.86 fps) while maintaining high accuracy and robustness.
The paper also includes detailed implementation details, ablation studies, and visualization results to support the effectiveness of the proposed methods.SDSTrack is a novel symmetric multimodal tracking framework designed to enhance the robustness and accuracy of visual object tracking (VOT) in complex environments. The framework addresses the limitations of traditional RGB-based trackers by introducing lightweight adaptation techniques and a complementary masked patch distillation strategy. Key contributions include:
1. **Symmetric Multimodal Adaptation (SMA)**: This technique efficiently transfers the feature extraction capability from RGB to other modalities (e.g., Depth, Thermal, Event) using adapters, ensuring balanced and symmetric feature fusion.
2. **Complementary Masked Patch Distillation**: This strategy enhances the model's robustness by randomly masking patches in one modality and distilling the knowledge from clean and masked data, improving the model's ability to handle extreme conditions.
3. **Parameter-Efficient Fine-Tuning**: The framework employs parameter-efficient fine-tuning (PEFT) to minimize training parameters, making it suitable for limited multimodal data.
4. **Performance on Multiple Datasets**: Extensive experiments on various benchmarks (DepthTrack, VOT-RGBD2022, RGBT234, VisEvent) demonstrate that SDSTrack outperforms state-of-the-art methods, achieving new SOTA performance in precision, recall, and F-score.
5. **Robustness in Extreme Conditions**: The method shows superior robustness in challenging scenarios, such as missing or occluded modalities, with significant improvements in precision and success rates.
6. **Inference Speed**: SDSTrack achieves real-time tracking (20.86 fps) while maintaining high accuracy and robustness.
The paper also includes detailed implementation details, ablation studies, and visualization results to support the effectiveness of the proposed methods.