UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

14 Jun 2024 | Dongchao Yang1, Haohan Guo1, Yuanyuan Wang2, Rongjie Huang1, Xiang Li2 Xu Tan3, Xixin Wu1, Helen Meng1
The paper introduces UniAudio 1.5, a novel approach that leverages large language models (LLMs) to perform cross-modal in-context learning for audio tasks. The key innovation is the LLM-Codec, a model that converts audio data into a textual space, enabling LLMs to understand and generate audio without fine-tuning. The LLM-Codec uses a multi-scale residual vector quantization (RVQ) strategy to balance completeness and compactness, encoding audio data into a token space of the LLMs. This approach reduces modality heterogeneity and allows LLMs to learn new tasks with a few demonstrations. Experiments on various audio understanding and generation tasks, such as speech emotion classification, audio classification, text-to-speech generation, and speech enhancement, demonstrate the effectiveness of UniAudio 1.5. The paper also includes ablation studies and visualizations to support the findings, highlighting the importance of semantic and consistency losses in training the LLM-Codec. The open-sourced LLM-Codec model facilitates further research on few-shot audio task learning and multi-modal LLMs.The paper introduces UniAudio 1.5, a novel approach that leverages large language models (LLMs) to perform cross-modal in-context learning for audio tasks. The key innovation is the LLM-Codec, a model that converts audio data into a textual space, enabling LLMs to understand and generate audio without fine-tuning. The LLM-Codec uses a multi-scale residual vector quantization (RVQ) strategy to balance completeness and compactness, encoding audio data into a token space of the LLMs. This approach reduces modality heterogeneity and allows LLMs to learn new tasks with a few demonstrations. Experiments on various audio understanding and generation tasks, such as speech emotion classification, audio classification, text-to-speech generation, and speech enhancement, demonstrate the effectiveness of UniAudio 1.5. The paper also includes ablation studies and visualizations to support the findings, highlighting the importance of semantic and consistency losses in training the LLM-Codec. The open-sourced LLM-Codec model facilitates further research on few-shot audio task learning and multi-modal LLMs.
Reach us at info@study.space
Understanding UniAudio 1.5%3A Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner