25 Apr 2024 | Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos
This paper investigates the in-context learning (ICL) capabilities of state-space models (SSMs), particularly Mamba, and compares them with Transformer models. The study evaluates various ICL tasks, including regression, decision trees, sparse parity, and retrieval. Results show that while Mamba performs comparably to Transformers in standard regression tasks, it outperforms them in tasks like sparse parity learning. However, Mamba struggles in tasks involving non-standard retrieval functionality. To address these limitations, the authors introduce MambaFormer, a hybrid model that combines Mamba with attention blocks. MambaFormer demonstrates superior performance in tasks where individual models struggle, such as sparse parity and retrieval. The findings suggest that hybrid architectures offer promising avenues for enhancing ICL in language models. The study also explores the performance of hybrid models on synthetic formal language ICL datasets, showing that they perform as well as or better than Transformer and Mamba in these tasks. Overall, the research highlights the potential of hybrid architectures in improving ICL capabilities in language models.This paper investigates the in-context learning (ICL) capabilities of state-space models (SSMs), particularly Mamba, and compares them with Transformer models. The study evaluates various ICL tasks, including regression, decision trees, sparse parity, and retrieval. Results show that while Mamba performs comparably to Transformers in standard regression tasks, it outperforms them in tasks like sparse parity learning. However, Mamba struggles in tasks involving non-standard retrieval functionality. To address these limitations, the authors introduce MambaFormer, a hybrid model that combines Mamba with attention blocks. MambaFormer demonstrates superior performance in tasks where individual models struggle, such as sparse parity and retrieval. The findings suggest that hybrid architectures offer promising avenues for enhancing ICL in language models. The study also explores the performance of hybrid models on synthetic formal language ICL datasets, showing that they perform as well as or better than Transformer and Mamba in these tasks. Overall, the research highlights the potential of hybrid architectures in improving ICL capabilities in language models.