Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

26 Jul 2024 | Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, and Haibin Ling
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance This paper proposes LoRAT, a method that leverages LoRA (Low-Rank Adaptation) to enhance the performance of visual tracking models. LoRA is a parameter-efficient fine-tuning technique that allows for efficient adaptation of pre-trained models without significant computational overhead. The key challenge in applying LoRA to visual tracking is the design of positional embeddings and the inductive biases of convolutional heads, which can hinder performance. To address these issues, the authors propose two design changes: decoupling positional embeddings into shared and independent components, and replacing the convolutional head with an MLP-based anchor-free head. The proposed tracker, LoRAT, is trained on various large-scale benchmarks, including LaSOT, LaSOT_ext, TrackingNet, GOT-10k, and TNL2K. The results show that LoRAT achieves state-of-the-art performance with significantly reduced training time and memory usage. For example, the LoRAT-B-224 variant achieves a SUC score of 0.717 on LaSOT while running at 209 FPS. The LoRAT-g-378 variant, using the largest ViT backbone, achieves a SUC score of 0.762, outperforming previous state-of-the-art models. The paper also evaluates the efficiency of LoRAT, showing that it can be trained on a single NVIDIA RTX 4090 GPU within 11 hours with 605 inference FPS. The results demonstrate that LoRAT is not only effective but also efficient, making it a practical solution for training advanced tracking models with manageable resources. The study bridges the gap between advanced large-scale models and resource accessibility, marking a pivotal step towards sustainable advancements in the field of visual tracking.Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance This paper proposes LoRAT, a method that leverages LoRA (Low-Rank Adaptation) to enhance the performance of visual tracking models. LoRA is a parameter-efficient fine-tuning technique that allows for efficient adaptation of pre-trained models without significant computational overhead. The key challenge in applying LoRA to visual tracking is the design of positional embeddings and the inductive biases of convolutional heads, which can hinder performance. To address these issues, the authors propose two design changes: decoupling positional embeddings into shared and independent components, and replacing the convolutional head with an MLP-based anchor-free head. The proposed tracker, LoRAT, is trained on various large-scale benchmarks, including LaSOT, LaSOT_ext, TrackingNet, GOT-10k, and TNL2K. The results show that LoRAT achieves state-of-the-art performance with significantly reduced training time and memory usage. For example, the LoRAT-B-224 variant achieves a SUC score of 0.717 on LaSOT while running at 209 FPS. The LoRAT-g-378 variant, using the largest ViT backbone, achieves a SUC score of 0.762, outperforming previous state-of-the-art models. The paper also evaluates the efficiency of LoRAT, showing that it can be trained on a single NVIDIA RTX 4090 GPU within 11 hours with 605 inference FPS. The results demonstrate that LoRAT is not only effective but also efficient, making it a practical solution for training advanced tracking models with manageable resources. The study bridges the gap between advanced large-scale models and resource accessibility, marking a pivotal step towards sustainable advancements in the field of visual tracking.
Reach us at info@study.space
[slides and audio] Tracking Meets LoRA%3A Faster Training%2C Larger Model%2C Stronger Performance