SpeechVerse: A Large-scale Generalizable Audio Language Model

SpeechVerse: A Large-scale Generalizable Audio Language Model

31 May 2024 | Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Daniel Garcia-Romero, Kyu J. Han, Katrin Kirchhoff
SpeechVerse is a large-scale, generalizable audio-language model that enables large language models (LLMs) to follow natural language instructions for diverse speech processing tasks. The model is trained using a multi-task framework that combines pre-trained speech and text foundation models with a small set of learnable parameters, while keeping the pre-trained models frozen during training. The model is instruction fine-tuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a wide range of speech processing tasks using natural language instructions. The model is evaluated on a variety of tasks, including automatic speech recognition (ASR), spoken language understanding (SLU), and paralinguistic speech processing (PSP). The results show that SpeechVerse outperforms conventional task-specific baselines on 9 out of 11 tasks. The model is also tested on out-of-domain datasets, novel prompts, and unseen tasks, demonstrating its robustness and generalization capabilities. The model's performance is further improved through curriculum learning and parameter-efficient fine-tuning, which allows it to scale to a large number of diverse datasets and tasks with limited compute resources. SpeechVerse is trained using a unified curriculum that incorporates multi-task learning and supervised instruction fine-tuning without the need for task-specific tagging, enabling generalization to unseen tasks using natural language instructions. The model's architecture consists of three main components: a pre-trained audio encoder, a 1-D convolution module, and a pre-trained LLM. The audio encoder extracts semantic features from the audio, the convolution module downsamples the audio features, and the LLM uses these features and textual instructions to perform the required task. The model is evaluated on a variety of tasks, including ASR, SLU, and PSP, and shows strong performance across these tasks. The results demonstrate that SpeechVerse is a robust and generalizable model that can perform a wide range of speech processing tasks using natural language instructions.SpeechVerse is a large-scale, generalizable audio-language model that enables large language models (LLMs) to follow natural language instructions for diverse speech processing tasks. The model is trained using a multi-task framework that combines pre-trained speech and text foundation models with a small set of learnable parameters, while keeping the pre-trained models frozen during training. The model is instruction fine-tuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a wide range of speech processing tasks using natural language instructions. The model is evaluated on a variety of tasks, including automatic speech recognition (ASR), spoken language understanding (SLU), and paralinguistic speech processing (PSP). The results show that SpeechVerse outperforms conventional task-specific baselines on 9 out of 11 tasks. The model is also tested on out-of-domain datasets, novel prompts, and unseen tasks, demonstrating its robustness and generalization capabilities. The model's performance is further improved through curriculum learning and parameter-efficient fine-tuning, which allows it to scale to a large number of diverse datasets and tasks with limited compute resources. SpeechVerse is trained using a unified curriculum that incorporates multi-task learning and supervised instruction fine-tuning without the need for task-specific tagging, enabling generalization to unseen tasks using natural language instructions. The model's architecture consists of three main components: a pre-trained audio encoder, a 1-D convolution module, and a pre-trained LLM. The audio encoder extracts semantic features from the audio, the convolution module downsamples the audio features, and the LLM uses these features and textual instructions to perform the required task. The model is evaluated on a variety of tasks, including ASR, SLU, and PSP, and shows strong performance across these tasks. The results demonstrate that SpeechVerse is a robust and generalizable model that can perform a wide range of speech processing tasks using natural language instructions.
Reach us at info@study.space