2 Jul 2024 | Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, Johannes Brandstetter
The paper introduces Vision-LSTM (ViL), a novel architecture that adapts the xLSTM building blocks to computer vision tasks. ViL is designed to process sequences of patch tokens from images, where odd blocks process the sequence row-wise and even blocks process it column-wise. This alternating design allows ViL to efficiently handle non-sequential inputs like images without introducing additional computational overhead. ViL is compared against other isotropic architectures and shows strong performance in classification, semantic segmentation, and transfer learning tasks on datasets such as ImageNet-1K, ADE20K, and VTAB-1K. The experiments demonstrate that ViL outperforms heavily optimized Vision Transformers (ViTs) and other vision adaptation models, particularly in tasks requiring high-resolution images, such as medical imaging and segmentation. The paper also explores different block designs and classification strategies, highlighting the effectiveness of alternating blocks and the "bilateral concat" pooling method. Future work includes improving pre-training schemes, exploring better hyperparameter settings, and investigating hierarchical architectures.The paper introduces Vision-LSTM (ViL), a novel architecture that adapts the xLSTM building blocks to computer vision tasks. ViL is designed to process sequences of patch tokens from images, where odd blocks process the sequence row-wise and even blocks process it column-wise. This alternating design allows ViL to efficiently handle non-sequential inputs like images without introducing additional computational overhead. ViL is compared against other isotropic architectures and shows strong performance in classification, semantic segmentation, and transfer learning tasks on datasets such as ImageNet-1K, ADE20K, and VTAB-1K. The experiments demonstrate that ViL outperforms heavily optimized Vision Transformers (ViTs) and other vision adaptation models, particularly in tasks requiring high-resolution images, such as medical imaging and segmentation. The paper also explores different block designs and classification strategies, highlighting the effectiveness of alternating blocks and the "bilateral concat" pooling method. Future work includes improving pre-training schemes, exploring better hyperparameter settings, and investigating hierarchical architectures.