Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

7 Mar 2024 | Yuchen Duan*2,1, Weiyun Wang*3,1, Zhe Chen*4,1, Xizhou Zhu5,1,6, Lewei Lu6, Tong Lu4, Yu Qiao1, Hongsheng Li2, Jifeng Dai5,1, and Wenhai Wang2,1✉
The paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV architecture for natural language processing, tailored for visual perception tasks. VRWKV is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities while scaling up effectively. Key contributions include: 1. **Efficiency and Scalability**: VRWKV reduces spatial aggregation complexity, making it adept at processing high-resolution images without the need for windowing operations. 2. **Linear Attention Mechanism**: The model introduces bidirectional global attention and a quad-directional shift (Q-Shift) operation, enabling linear complexity in global attention. 3. **Stability and Scalability**: VRWKV employs relative positional bias, layer scale, and extra layer normalization to ensure stable scalability and stability during training. **Experiments**: - **Image Classification**: VRWKV outperforms ViT in top-1 accuracy on ImageNet-1K and ImageNet-22K, with lower computational costs. - **Object Detection**: VRWKV achieves better performance than ViT with lower FLOPs on the COCO dataset. - **Semantic Segmentation**: VRWKV consistently outperforms ViT with global attention in UperNet on the ADE20K dataset, while being more efficient. **Ablation Study**: - **Token Shift**: Q-Shift significantly enhances the receptive field and global attention. - **Bidirectional Attention**: Global attention improves performance. - **Effective Receptive Field (ERF)**: Q-Shift expands the core range of the receptive field. - **Efficiency Analysis**: VRWKV demonstrates superior inference speed and memory efficiency compared to ViT, especially at high resolutions. **Conclusion**: VRWKV is proposed as an efficient and scalable alternative to ViT, showcasing the potential of linear complexity transformers in vision tasks.The paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV architecture for natural language processing, tailored for visual perception tasks. VRWKV is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities while scaling up effectively. Key contributions include: 1. **Efficiency and Scalability**: VRWKV reduces spatial aggregation complexity, making it adept at processing high-resolution images without the need for windowing operations. 2. **Linear Attention Mechanism**: The model introduces bidirectional global attention and a quad-directional shift (Q-Shift) operation, enabling linear complexity in global attention. 3. **Stability and Scalability**: VRWKV employs relative positional bias, layer scale, and extra layer normalization to ensure stable scalability and stability during training. **Experiments**: - **Image Classification**: VRWKV outperforms ViT in top-1 accuracy on ImageNet-1K and ImageNet-22K, with lower computational costs. - **Object Detection**: VRWKV achieves better performance than ViT with lower FLOPs on the COCO dataset. - **Semantic Segmentation**: VRWKV consistently outperforms ViT with global attention in UperNet on the ADE20K dataset, while being more efficient. **Ablation Study**: - **Token Shift**: Q-Shift significantly enhances the receptive field and global attention. - **Bidirectional Attention**: Global attention improves performance. - **Effective Receptive Field (ERF)**: Q-Shift expands the core range of the receptive field. - **Efficiency Analysis**: VRWKV demonstrates superior inference speed and memory efficiency compared to ViT, especially at high resolutions. **Conclusion**: VRWKV is proposed as an efficient and scalable alternative to ViT, showcasing the potential of linear complexity transformers in vision tasks.
Reach us at info@study.space