13 Feb 2024 | Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
This paper presents an embarrassingly simple approach for large language models (LLMs) with strong automatic speech recognition (ASR) capabilities. The authors propose SLAM-ASR, a system that uses a pre-trained speech encoder, a pre-trained LLM, and a single trainable linear projector to perform ASR. The system achieves state-of-the-art performance on the Librispeech benchmark, outperforming other LLM-based ASR models and even surpassing the latest audio-universal LLM models trained on massive pair data. The key insight is that a simple combination of a powerful speech encoder, a suitable LLM, and a single trainable linear projector is sufficient for ASR tasks, without the need for complex designs.
The paper explores various combinations of LLMs and speech encoders to find the optimal LLM-based ASR system. It benchmarks different models and finds that chat models perform better than raw pre-trained LLMs for ASR. The study also investigates the capability emergence of LLM-based ASR during training, showing that the model's performance improves significantly as it learns to align speech and text modalities.
The proposed SLAM-ASR system is implemented with a clean setup and minimal task-specific design. It uses a speech encoder, a projector, and an LLM to perform ASR. The speech encoder is downsampled to reduce the length of the speech features, and the projector transforms the speech features into the same dimension as the LLM input. The system is trained with a simple setup, using only a linear projector, and achieves state-of-the-art performance on the Librispeech benchmark.
The paper also compares SLAM-ASR with previous NN-based ASR models and finds that it outperforms them in terms of performance. The study highlights the potential of LLM-based ASR systems and suggests that they can be extended to have cross-modal capabilities. The results show that SLAM-ASR is a promising approach for ASR tasks, with the potential to be applied in various real-world scenarios.This paper presents an embarrassingly simple approach for large language models (LLMs) with strong automatic speech recognition (ASR) capabilities. The authors propose SLAM-ASR, a system that uses a pre-trained speech encoder, a pre-trained LLM, and a single trainable linear projector to perform ASR. The system achieves state-of-the-art performance on the Librispeech benchmark, outperforming other LLM-based ASR models and even surpassing the latest audio-universal LLM models trained on massive pair data. The key insight is that a simple combination of a powerful speech encoder, a suitable LLM, and a single trainable linear projector is sufficient for ASR tasks, without the need for complex designs.
The paper explores various combinations of LLMs and speech encoders to find the optimal LLM-based ASR system. It benchmarks different models and finds that chat models perform better than raw pre-trained LLMs for ASR. The study also investigates the capability emergence of LLM-based ASR during training, showing that the model's performance improves significantly as it learns to align speech and text modalities.
The proposed SLAM-ASR system is implemented with a clean setup and minimal task-specific design. It uses a speech encoder, a projector, and an LLM to perform ASR. The speech encoder is downsampled to reduce the length of the speech features, and the projector transforms the speech features into the same dimension as the LLM input. The system is trained with a simple setup, using only a linear projector, and achieves state-of-the-art performance on the Librispeech benchmark.
The paper also compares SLAM-ASR with previous NN-based ASR models and finds that it outperforms them in terms of performance. The study highlights the potential of LLM-based ASR systems and suggests that they can be extended to have cross-modal capabilities. The results show that SLAM-ASR is a promising approach for ASR tasks, with the potential to be applied in various real-world scenarios.