13 Feb 2024 | Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
This paper focuses on solving the task of automatic speech recognition (ASR) using large language models (LLMs) and speech encoders. The authors challenge the notion that complex designs are necessary for effective ASR systems, proposing an embarrassingly simple approach. They benchmark various combinations of LLMs and speech encoders, leading to the development of SLAM-ASR, a system that only requires a trainable linear projector to align the speech encoder and LLM. SLAM-ASR achieves state-of-the-art performance on the Librispeech benchmark, outperforming both LLM-based and NN-based ASR models. The study also explores the capability emergence during the training process of LLM-based ASR systems, providing insights into the effectiveness of freezing the speech encoder and the importance of prompt engineering. The research highlights the potential of LLM-based ASR and offers a clean and efficient framework for future advancements in the field.This paper focuses on solving the task of automatic speech recognition (ASR) using large language models (LLMs) and speech encoders. The authors challenge the notion that complex designs are necessary for effective ASR systems, proposing an embarrassingly simple approach. They benchmark various combinations of LLMs and speech encoders, leading to the development of SLAM-ASR, a system that only requires a trainable linear projector to align the speech encoder and LLM. SLAM-ASR achieves state-of-the-art performance on the Librispeech benchmark, outperforming both LLM-based and NN-based ASR models. The study also explores the capability emergence during the training process of LLM-based ASR systems, providing insights into the effectiveness of freezing the speech encoder and the importance of prompt engineering. The research highlights the potential of LLM-based ASR and offers a clean and efficient framework for future advancements in the field.