WavLLM: Towards Robust and Adaptive Speech Large Language Model

WavLLM: Towards Robust and Adaptive Speech Large Language Model

2024 | Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei
WavLLM is a robust and adaptive speech large language model that integrates dual encoders and a prompt-aware LoRA weight adapter, optimized through a two-stage curriculum learning approach. The model uses a Whisper encoder to process semantic content of speech and a WavLM encoder to capture speaker identity characteristics. It first trains on mixed single tasks like ASR, ST, SV, ER, IT, and SQA, then advances to complex multi-task training. A prompt-aware LoRA weight adapter is introduced to enhance flexibility and adherence to different tasks. The model is validated on universal speech benchmarks and specialized datasets, achieving state-of-the-art performance across various speech tasks. It successfully completes Gaokao English listening comprehension tasks without specialized training. The model demonstrates strong generalization and CoT capabilities, showing impressive performance in zero-shot SQA and CoT-based tasks. The model architecture includes dual speech encoders, a large language model with a prompt-aware LoRA weight adapter, and a curriculum learning-based training approach. The model is trained on a two-stage curriculum learning method, starting with simple tasks and progressing to complex ones. The model's performance is evaluated on single and multi-task instructions, showing robust generalization and CoT capabilities. The model's effectiveness is demonstrated through experiments on various speech tasks, including ASR, ST, SV, ER, SQA, and CoT-based tasks. The model's performance is compared with other open-source speech LLMs, showing superior results in multiple tasks. The model's robustness is evaluated by comparing performance on seen and unseen prompts, demonstrating strong generalization capabilities. The model's effectiveness is further validated through visualization of LoRA weights and analysis of its performance on different tasks. The model's architecture and training method are designed to enhance its ability to process speech and follow instructions, making it a powerful tool for speech-related tasks.WavLLM is a robust and adaptive speech large language model that integrates dual encoders and a prompt-aware LoRA weight adapter, optimized through a two-stage curriculum learning approach. The model uses a Whisper encoder to process semantic content of speech and a WavLM encoder to capture speaker identity characteristics. It first trains on mixed single tasks like ASR, ST, SV, ER, IT, and SQA, then advances to complex multi-task training. A prompt-aware LoRA weight adapter is introduced to enhance flexibility and adherence to different tasks. The model is validated on universal speech benchmarks and specialized datasets, achieving state-of-the-art performance across various speech tasks. It successfully completes Gaokao English listening comprehension tasks without specialized training. The model demonstrates strong generalization and CoT capabilities, showing impressive performance in zero-shot SQA and CoT-based tasks. The model architecture includes dual speech encoders, a large language model with a prompt-aware LoRA weight adapter, and a curriculum learning-based training approach. The model is trained on a two-stage curriculum learning method, starting with simple tasks and progressing to complex ones. The model's performance is evaluated on single and multi-task instructions, showing robust generalization and CoT capabilities. The model's effectiveness is demonstrated through experiments on various speech tasks, including ASR, ST, SV, ER, SQA, and CoT-based tasks. The model's performance is compared with other open-source speech LLMs, showing superior results in multiple tasks. The model's robustness is evaluated by comparing performance on seen and unseen prompts, demonstrating strong generalization capabilities. The model's effectiveness is further validated through visualization of LoRA weights and analysis of its performance on different tasks. The model's architecture and training method are designed to enhance its ability to process speech and follow instructions, making it a powerful tool for speech-related tasks.
Reach us at info@study.space
[slides and audio] WavLLM%3A Towards Robust and Adaptive Speech Large Language Model