Vision-LSTM: xLSTM as Generic Vision Backbone

Vision-LSTM: xLSTM as Generic Vision Backbone

2 Jul 2024 | Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, Johannes Brandstetter
Vision-LSTM (ViL) is a new vision backbone based on the xLSTM architecture, designed for computer vision tasks. ViL uses alternating mLSTM blocks to process image patches in a row-wise and column-wise manner, enabling efficient processing of non-sequential inputs like images. The architecture is parallelizable and has linear computational and memory complexity with respect to sequence length, making it suitable for high-resolution tasks such as medical imaging, segmentation, and physics simulations. ViL outperforms existing vision models like Vision Transformers (ViT) and Vision Mamba (Vim) in classification and segmentation tasks on ImageNet-1K, ADE20K, and VTAB-1K benchmarks. It also shows strong performance in transfer learning and semantic segmentation. ViL's alternating block design allows it to process images efficiently without additional computations, and it is more parameter and compute efficient than other models. The architecture is also robust to different classification designs and can be adapted to various vision tasks. Future work includes exploring hierarchical architectures and improving pre-training schemes to further enhance ViL's performance.Vision-LSTM (ViL) is a new vision backbone based on the xLSTM architecture, designed for computer vision tasks. ViL uses alternating mLSTM blocks to process image patches in a row-wise and column-wise manner, enabling efficient processing of non-sequential inputs like images. The architecture is parallelizable and has linear computational and memory complexity with respect to sequence length, making it suitable for high-resolution tasks such as medical imaging, segmentation, and physics simulations. ViL outperforms existing vision models like Vision Transformers (ViT) and Vision Mamba (Vim) in classification and segmentation tasks on ImageNet-1K, ADE20K, and VTAB-1K benchmarks. It also shows strong performance in transfer learning and semantic segmentation. ViL's alternating block design allows it to process images efficiently without additional computations, and it is more parameter and compute efficient than other models. The architecture is also robust to different classification designs and can be adapted to various vision tasks. Future work includes exploring hierarchical architectures and improving pre-training schemes to further enhance ViL's performance.
Reach us at info@study.space
Understanding Vision-LSTM%3A xLSTM as Generic Vision Backbone