31 May 2024 | Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Daniel Garcia-Romero, Kyu J. Han, Katrin Kirchhoff
**SpeechVerse: A Large-scale Generalizable Audio Language Model**
**Abstract:**
Large language models (LLMs) have demonstrated remarkable proficiency in tasks requiring semantic understanding of natural language instructions. However, their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. To address this, we develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters while keeping the pre-trained models frozen during training. The models are instruction fine-tuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. Extensive benchmarking shows that SpeechVerse outperforms conventional task-specific baselines on 9 out of 11 tasks, demonstrating its superior instruction-following capability.
**Introduction:**
LLMs have shown significant progress in natural language tasks through self-supervised pre-training on large text corpora. However, their ability to perceive non-textual modalities like speech remains limited. SpeechVerse aims to empower LLMs to deeply understand speech, enhancing human-computer interaction and multimodal dialog agents. Unlike existing approaches that first transcribe speech and then process text, SpeechVerse directly fuses textual LLMs with speech encoders within an end-to-end training framework. This approach enables richer speech and audio comprehension compared to text-only methods.
**Approach:**
SpeechVerse's architecture consists of three main components: a pre-trained audio encoder, a 1-D convolution module, and a pre-trained LLM. The audio encoder extracts semantic features from audio signals, the convolution module downsamples these features to match the length of text tokens, and the LLM uses these features and textual instructions to perform the task. The model is trained using supervised instruction finetuning, leveraging continuous representations from the speech foundation model.
**Experiments:**
SpeechVerse models are evaluated on 11 diverse tasks across multiple domains and datasets, including automatic speech recognition (ASR), spoken language understanding (SLU), and paralinguistic speech processing (PSP). The results show that SpeechVerse outperforms conventional task-specific models on most tasks, demonstrating strong zero-shot generalization and robust performance on out-of-domain datasets, unseen prompts, and unseen tasks.
**Conclusion:**
SpeechVerse is a versatile framework that enables LLMs to follow natural language instructions for diverse speech processing tasks. Its superior performance and generalization capabilities highlight the efficacy of the proposed training methodology. Future work aims to expand SpeechVerse's capabilities to handle even more complex instructions and generalize to new domains.**SpeechVerse: A Large-scale Generalizable Audio Language Model**
**Abstract:**
Large language models (LLMs) have demonstrated remarkable proficiency in tasks requiring semantic understanding of natural language instructions. However, their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. To address this, we develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters while keeping the pre-trained models frozen during training. The models are instruction fine-tuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. Extensive benchmarking shows that SpeechVerse outperforms conventional task-specific baselines on 9 out of 11 tasks, demonstrating its superior instruction-following capability.
**Introduction:**
LLMs have shown significant progress in natural language tasks through self-supervised pre-training on large text corpora. However, their ability to perceive non-textual modalities like speech remains limited. SpeechVerse aims to empower LLMs to deeply understand speech, enhancing human-computer interaction and multimodal dialog agents. Unlike existing approaches that first transcribe speech and then process text, SpeechVerse directly fuses textual LLMs with speech encoders within an end-to-end training framework. This approach enables richer speech and audio comprehension compared to text-only methods.
**Approach:**
SpeechVerse's architecture consists of three main components: a pre-trained audio encoder, a 1-D convolution module, and a pre-trained LLM. The audio encoder extracts semantic features from audio signals, the convolution module downsamples these features to match the length of text tokens, and the LLM uses these features and textual instructions to perform the task. The model is trained using supervised instruction finetuning, leveraging continuous representations from the speech foundation model.
**Experiments:**
SpeechVerse models are evaluated on 11 diverse tasks across multiple domains and datasets, including automatic speech recognition (ASR), spoken language understanding (SLU), and paralinguistic speech processing (PSP). The results show that SpeechVerse outperforms conventional task-specific models on most tasks, demonstrating strong zero-shot generalization and robust performance on out-of-domain datasets, unseen prompts, and unseen tasks.
**Conclusion:**
SpeechVerse is a versatile framework that enables LLMs to follow natural language instructions for diverse speech processing tasks. Its superior performance and generalization capabilities highlight the efficacy of the proposed training methodology. Future work aims to expand SpeechVerse's capabilities to handle even more complex instructions and generalize to new domains.