Visual Prompt Tuning (VPT) is a parameter-efficient method for adapting large pre-trained vision Transformers to downstream tasks. Unlike full fine-tuning, which updates all model parameters, VPT introduces only a small number of learnable parameters in the input space while keeping the model backbone frozen. This approach significantly reduces storage costs and achieves performance gains compared to other parameter-efficient tuning protocols. VPT outperforms full fine-tuning in many cases, even when the model size and training data scale vary. It is particularly effective in low-data regimes and maintains its advantage across different data scales. VPT is also competitive across various Transformer architectures, including ViT-Base, ViT-Large, and ViT-Huge, as well as Swin Transformers. The method involves inserting learnable prompts into the input space of the Transformer layers, with two variants: VPT-SHALLOW, which inserts prompts only in the first layer, and VPT-DEEP, which inserts prompts in all layers. Experiments on a wide range of downstream tasks show that VPT achieves state-of-the-art results with significantly fewer parameters. VPT is also effective for other vision tasks, including semantic segmentation and applying to ConvNets. The method is efficient and effective, making it a promising approach for adapting large vision models.Visual Prompt Tuning (VPT) is a parameter-efficient method for adapting large pre-trained vision Transformers to downstream tasks. Unlike full fine-tuning, which updates all model parameters, VPT introduces only a small number of learnable parameters in the input space while keeping the model backbone frozen. This approach significantly reduces storage costs and achieves performance gains compared to other parameter-efficient tuning protocols. VPT outperforms full fine-tuning in many cases, even when the model size and training data scale vary. It is particularly effective in low-data regimes and maintains its advantage across different data scales. VPT is also competitive across various Transformer architectures, including ViT-Base, ViT-Large, and ViT-Huge, as well as Swin Transformers. The method involves inserting learnable prompts into the input space of the Transformer layers, with two variants: VPT-SHALLOW, which inserts prompts only in the first layer, and VPT-DEEP, which inserts prompts in all layers. Experiments on a wide range of downstream tasks show that VPT achieves state-of-the-art results with significantly fewer parameters. VPT is also effective for other vision tasks, including semantic segmentation and applying to ConvNets. The method is efficient and effective, making it a promising approach for adapting large vision models.