16 Jul 2024 | Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun
This paper explores the application of Rotary Position Embedding (RoPE) to Vision Transformers (ViTs), a technique originally designed for language models. RoPE is known for its effectiveness in language models, particularly in handling length extrapolation. The study aims to investigate whether RoPE can similarly enhance the performance of ViTs, especially in multi-resolution tasks such as image classification, object detection, and semantic segmentation.
The authors propose a 2D implementation of RoPE, named RoPE-Mixed, which uses mixed axis frequencies to handle diagonal directions, a feature that is crucial for convolution networks. This approach addresses the limitations of axial frequency, which only supports axial directions and may degrade performance in vision tasks.
Experiments are conducted on two representative ViT architectures, ViT and Swin Transformer, using various datasets and tasks. The results show that 2D RoPE, particularly RoPE-Mixed, significantly improves performance in multi-resolution classification, object detection, and semantic segmentation. RoPE-Mixed outperforms conventional position embeddings (Absolute Positional Embedding, APE, and Relative Position Bias, RPB) and even conventional 2D RoPE variants (RoPE-Axial).
The study concludes that 2D RoPE, especially RoPE-Mixed, is a promising solution for improving the performance of ViTs in multi-resolution tasks with minimal computational overhead. The authors provide code and pre-trained models to facilitate further research and application.This paper explores the application of Rotary Position Embedding (RoPE) to Vision Transformers (ViTs), a technique originally designed for language models. RoPE is known for its effectiveness in language models, particularly in handling length extrapolation. The study aims to investigate whether RoPE can similarly enhance the performance of ViTs, especially in multi-resolution tasks such as image classification, object detection, and semantic segmentation.
The authors propose a 2D implementation of RoPE, named RoPE-Mixed, which uses mixed axis frequencies to handle diagonal directions, a feature that is crucial for convolution networks. This approach addresses the limitations of axial frequency, which only supports axial directions and may degrade performance in vision tasks.
Experiments are conducted on two representative ViT architectures, ViT and Swin Transformer, using various datasets and tasks. The results show that 2D RoPE, particularly RoPE-Mixed, significantly improves performance in multi-resolution classification, object detection, and semantic segmentation. RoPE-Mixed outperforms conventional position embeddings (Absolute Positional Embedding, APE, and Relative Position Bias, RPB) and even conventional 2D RoPE variants (RoPE-Axial).
The study concludes that 2D RoPE, especially RoPE-Mixed, is a promising solution for improving the performance of ViTs in multi-resolution tasks with minimal computational overhead. The authors provide code and pre-trained models to facilitate further research and application.