UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

14 Jun 2024 | Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng
UniAudio 1.5 is a large language model-driven audio codec that enables cross-modal in-context learning for audio tasks. The paper proposes LLM-Codec, a novel audio codec model that maps audio data into the token space of large language models (LLMs), reducing modality heterogeneity between text and audio. This allows LLMs to learn audio tasks with few demonstrations. The LLM-Codec uses a multi-scale residual vector quantization (RVQ) strategy to balance completeness and compactness, with different layers encoding semantic, acoustic, and residual information. The model is trained with semantic and consistency losses to improve performance. Experiments show that LLM-Codec achieves better reconstruction performance and is effective for audio understanding and generation tasks. The model is open-sourced to facilitate research on few-shot audio task learning and multi-modal LLMs. UniAudio 1.5 demonstrates the feasibility of using LLMs for audio tasks without parameter updates, achieving good performance on speech emotion classification, audio classification, text-to-speech generation, and speech denoising. The model is also effective for simple text-to-speech generation and speech denoising. The results show that LLM-Codec can be used to solve a wide range of audio tasks with few examples, validating the effectiveness of the proposed cross-modal in-context learning approach.UniAudio 1.5 is a large language model-driven audio codec that enables cross-modal in-context learning for audio tasks. The paper proposes LLM-Codec, a novel audio codec model that maps audio data into the token space of large language models (LLMs), reducing modality heterogeneity between text and audio. This allows LLMs to learn audio tasks with few demonstrations. The LLM-Codec uses a multi-scale residual vector quantization (RVQ) strategy to balance completeness and compactness, with different layers encoding semantic, acoustic, and residual information. The model is trained with semantic and consistency losses to improve performance. Experiments show that LLM-Codec achieves better reconstruction performance and is effective for audio understanding and generation tasks. The model is open-sourced to facilitate research on few-shot audio task learning and multi-modal LLMs. UniAudio 1.5 demonstrates the feasibility of using LLMs for audio tasks without parameter updates, achieving good performance on speech emotion classification, audio classification, text-to-speech generation, and speech denoising. The model is also effective for simple text-to-speech generation and speech denoising. The results show that LLM-Codec can be used to solve a wide range of audio tasks with few examples, validating the effectiveness of the proposed cross-modal in-context learning approach.
Reach us at info@study.space
Understanding UniAudio 1.5%3A Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner