Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

7 Mar 2024 | Yuchen Duan*, Weiyun Wang*, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, and Wenhai Wang
Vision-RWKV is a vision model inspired by the RWKV architecture used in natural language processing, adapted for visual tasks with modifications to handle high-resolution images efficiently. Unlike traditional Vision Transformers (ViTs), which have quadratic computational complexity, Vision-RWKV uses a linear complexity attention mechanism, enabling it to process high-resolution images without the need for windowing operations. This results in faster inference speeds and lower memory usage. The model incorporates a bidirectional global attention mechanism and a quad-directional token shift (Q-Shift) to enhance its ability to capture global information and maintain stability during scaling. Experiments show that Vision-RWKV outperforms ViT in image classification, dense prediction tasks, and semantic segmentation, achieving comparable or better performance with lower computational costs. The model is scalable, with variants ranging from VRWKV-Tiny (6M parameters) to VRWKV-Large (335M parameters). It is trained on large-scale datasets such as ImageNet-1K and ImageNet-22K, and demonstrates strong performance on tasks like object detection, instance segmentation, and semantic segmentation. The model's linear complexity attention mechanism allows it to efficiently process high-resolution images, making it a promising alternative to ViT for visual perception tasks.Vision-RWKV is a vision model inspired by the RWKV architecture used in natural language processing, adapted for visual tasks with modifications to handle high-resolution images efficiently. Unlike traditional Vision Transformers (ViTs), which have quadratic computational complexity, Vision-RWKV uses a linear complexity attention mechanism, enabling it to process high-resolution images without the need for windowing operations. This results in faster inference speeds and lower memory usage. The model incorporates a bidirectional global attention mechanism and a quad-directional token shift (Q-Shift) to enhance its ability to capture global information and maintain stability during scaling. Experiments show that Vision-RWKV outperforms ViT in image classification, dense prediction tasks, and semantic segmentation, achieving comparable or better performance with lower computational costs. The model is scalable, with variants ranging from VRWKV-Tiny (6M parameters) to VRWKV-Large (335M parameters). It is trained on large-scale datasets such as ImageNet-1K and ImageNet-22K, and demonstrates strong performance on tasks like object detection, instance segmentation, and semantic segmentation. The model's linear complexity attention mechanism allows it to efficiently process high-resolution images, making it a promising alternative to ViT for visual perception tasks.
Reach us at info@study.space
[slides] Vision-RWKV%3A Efficient and Scalable Visual Perception with RWKV-Like Architectures | StudySpace