[slides and audio] Tracking Meets LoRA%3A Faster Training%2C Larger Model%2C Stronger Performance

The paper introduces LoRAT, a method that combines Low-Rank Adaptation (LoRA) with visual tracking to achieve faster training, larger model sizes, and stronger performance. LoRAT leverages the efficiency of LoRA, a technique for parameter-efficient fine-tuning, to adapt large Vision Transformers (ViT) for visual tracking tasks. The key contributions include: 1. **Decoupled Positional Embeddings**: LoRAT decouples the positional embeddings in transformer-based trackers into shared spatial embeddings and independent token type embeddings, allowing for better adaptation of pre-trained ViT models. 2. **Anchor-Free Head Network**: A multilayer perceptron (MLP)-based anchor-free head is designed to replace the convolutional head, addressing inductive biases and improving performance with less computational overhead. 3. **Efficient Training and Inference**: LoRAT achieves significant improvements in training efficiency and inference speed, making it practical to train large-scale trackers with limited resources. 4. **State-of-the-Art Performance**: LoRAT sets new records on multiple benchmarks, including LaSOT, LaSOText, TrackingNet, GOT-10k, and TNL2K, with competitive or superior performance. The paper also provides detailed experimental results and ablation studies to validate the effectiveness of LoRAT, demonstrating its ability to handle large-scale visual tracking tasks efficiently and effectively.The paper introduces LoRAT, a method that combines Low-Rank Adaptation (LoRA) with visual tracking to achieve faster training, larger model sizes, and stronger performance. LoRAT leverages the efficiency of LoRA, a technique for parameter-efficient fine-tuning, to adapt large Vision Transformers (ViT) for visual tracking tasks. The key contributions include: 1. **Decoupled Positional Embeddings**: LoRAT decouples the positional embeddings in transformer-based trackers into shared spatial embeddings and independent token type embeddings, allowing for better adaptation of pre-trained ViT models. 2. **Anchor-Free Head Network**: A multilayer perceptron (MLP)-based anchor-free head is designed to replace the convolutional head, addressing inductive biases and improving performance with less computational overhead. 3. **Efficient Training and Inference**: LoRAT achieves significant improvements in training efficiency and inference speed, making it practical to train large-scale trackers with limited resources. 4. **State-of-the-Art Performance**: LoRAT sets new records on multiple benchmarks, including LaSOT, LaSOText, TrackingNet, GOT-10k, and TNL2K, with competitive or superior performance. The paper also provides detailed experimental results and ablation studies to validate the effectiveness of LoRAT, demonstrating its ability to handle large-scale visual tracking tasks efficiently and effectively.

Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

26 Jul 2024 | Liting Lin1, Heng Fan2, Zhipeng Zhang3, Yaowei Wang1,†, Yong Xu4,1, and Haibin Ling5,†