Rotary Position Embedding for Vision Transformer

Rotary Position Embedding for Vision Transformer

16 Jul 2024 | Byeongho Heo, Song Park, Dongyoon Han, Sangdo Yun
Rotary Position Embedding (RoPE) has shown strong performance in language models, particularly for length extrapolation. This study explores RoPE's effectiveness in Vision Transformers (ViTs) for computer vision tasks. The analysis reveals that RoPE maintains precision while increasing image resolution, leading to performance improvements in ImageNet-1k, COCO detection, and ADE-20k segmentation. The paper proposes RoPE-Mixed, a 2D RoPE variant using mixed axis frequencies, which outperforms traditional axial RoPE and other position embeddings. Experiments on ViT and Swin Transformer architectures show that RoPE-Mixed significantly improves multi-resolution classification, object detection, and semantic segmentation. The study demonstrates that RoPE-Mixed is effective for vision tasks, offering a promising solution for position embedding in ViTs with minimal computational overhead. The results indicate that RoPE-Mixed provides better performance than conventional position embeddings, particularly in extrapolation scenarios. The paper concludes that RoPE-Mixed is a valuable addition to vision transformer research, offering improved performance and flexibility for various tasks.Rotary Position Embedding (RoPE) has shown strong performance in language models, particularly for length extrapolation. This study explores RoPE's effectiveness in Vision Transformers (ViTs) for computer vision tasks. The analysis reveals that RoPE maintains precision while increasing image resolution, leading to performance improvements in ImageNet-1k, COCO detection, and ADE-20k segmentation. The paper proposes RoPE-Mixed, a 2D RoPE variant using mixed axis frequencies, which outperforms traditional axial RoPE and other position embeddings. Experiments on ViT and Swin Transformer architectures show that RoPE-Mixed significantly improves multi-resolution classification, object detection, and semantic segmentation. The study demonstrates that RoPE-Mixed is effective for vision tasks, offering a promising solution for position embedding in ViTs with minimal computational overhead. The results indicate that RoPE-Mixed provides better performance than conventional position embeddings, particularly in extrapolation scenarios. The paper concludes that RoPE-Mixed is a valuable addition to vision transformer research, offering improved performance and flexibility for various tasks.
Reach us at info@study.space
[slides] Rotary Position Embedding for Vision Transformer | StudySpace