WavLLM: Towards Robust and Adaptive Speech Large Language Model
**Authors:** Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sumit Sivasankaran, Linquan Liu, Furu Wei
**Institution:** The Chinese University of Hong Kong, Microsoft Corporation
**Abstract:**
This paper introduces WavLLM, a robust and adaptive speech large language model (LLM) designed to integrate listening capabilities into LLMs. WavLLM employs dual encoders—Whisper and WavLM—to process semantic and acoustic information, respectively. It is optimized using a two-stage curriculum learning approach, starting with basic single-task training and advancing to multi-task training. A prompt-aware LoRA weight adapter enhances the model's flexibility and adaptability to different tasks and instructions. Experiments on universal speech benchmarks and specialized datasets, such as the Gaokao English listening comprehension test, demonstrate WavLLM's state-of-the-art performance and robust generalization capabilities, particularly in complex tasks using Chain-of-Thought (CoT) approaches.
**Contributions:**
1. ** Curriculum Learning:** A two-stage curriculum learning approach that progressively fine-tunes LLMs to follow instructions, starting from simple tasks and advancing to complex ones.
2. **Dual Encoders:** Utilizes Whisper for semantic content and WavLM for acoustic information, enhancing speech representation and downstream task performance.
3. **Prompt-Aware LoRA:** Introduces a prompt-aware LoRA weight adapter to dynamically adjust LoRA weights based on different prompts, improving generalization.
**Methods:**
- **Model Architecture:** Combines Whisper and WavLM encoders with a large language model (LLM) backbone (LLaMA-2-7B-chat).
- **Training Stages:** Two-stage curriculum learning, including mixed single-task training and advanced multi-task training.
- **Evaluation:** Conducts single-task and multi-task evaluations, including ASR, ST, SV, ER, SQA, and CoT tasks.
**Results:**
- **Single-Task Evaluation:** Achieves state-of-the-art performance on various speech-related tasks.
- **Multi-Task Evaluation:** Demonstrates superior performance on zero-shot independent and CoT tasks compared to other open-source models.
- **Robustness Analysis:** Shows strong robustness to unseen and diverse prompts, outperforming baseline models.
**Conclusion:**
WavLLM is a robust and adaptive speech LLM that excels in various speech tasks and generalizes well to complex instructions. Future work will focus on enhancing CoT processing and adding speech synthesis capabilities.WavLLM: Towards Robust and Adaptive Speech Large Language Model
**Authors:** Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sumit Sivasankaran, Linquan Liu, Furu Wei
**Institution:** The Chinese University of Hong Kong, Microsoft Corporation
**Abstract:**
This paper introduces WavLLM, a robust and adaptive speech large language model (LLM) designed to integrate listening capabilities into LLMs. WavLLM employs dual encoders—Whisper and WavLM—to process semantic and acoustic information, respectively. It is optimized using a two-stage curriculum learning approach, starting with basic single-task training and advancing to multi-task training. A prompt-aware LoRA weight adapter enhances the model's flexibility and adaptability to different tasks and instructions. Experiments on universal speech benchmarks and specialized datasets, such as the Gaokao English listening comprehension test, demonstrate WavLLM's state-of-the-art performance and robust generalization capabilities, particularly in complex tasks using Chain-of-Thought (CoT) approaches.
**Contributions:**
1. ** Curriculum Learning:** A two-stage curriculum learning approach that progressively fine-tunes LLMs to follow instructions, starting from simple tasks and advancing to complex ones.
2. **Dual Encoders:** Utilizes Whisper for semantic content and WavLM for acoustic information, enhancing speech representation and downstream task performance.
3. **Prompt-Aware LoRA:** Introduces a prompt-aware LoRA weight adapter to dynamically adjust LoRA weights based on different prompts, improving generalization.
**Methods:**
- **Model Architecture:** Combines Whisper and WavLM encoders with a large language model (LLM) backbone (LLaMA-2-7B-chat).
- **Training Stages:** Two-stage curriculum learning, including mixed single-task training and advanced multi-task training.
- **Evaluation:** Conducts single-task and multi-task evaluations, including ASR, ST, SV, ER, SQA, and CoT tasks.
**Results:**
- **Single-Task Evaluation:** Achieves state-of-the-art performance on various speech-related tasks.
- **Multi-Task Evaluation:** Demonstrates superior performance on zero-shot independent and CoT tasks compared to other open-source models.
- **Robustness Analysis:** Shows strong robustness to unseen and diverse prompts, outperforming baseline models.
**Conclusion:**
WavLLM is a robust and adaptive speech LLM that excels in various speech tasks and generalizes well to complex instructions. Future work will focus on enhancing CoT processing and adding speech synthesis capabilities.