Understanding Qwen2-Audio Technical Report

This report introduces Qwen2-Audio, a large-scale audio-language model that can process audio and text inputs to generate textual outputs. It improves instruction-following capabilities and supports two interaction modes: audio analysis and voice chat. In audio analysis mode, users can provide audio and text instructions for analysis, while in voice chat mode, users can interact with Qwen2-Audio without text input. The model uses natural language prompts for pre-training, simplifying the process and expanding the data volume. It also employs DPO to optimize performance in terms of factuality and adherence to desired behavior. Evaluation results show that Qwen2-Audio outperforms previous SOTAs in audio-centric instruction-following tasks. It is open-sourced to foster the development of the multi-modal language community. The model's architecture includes an audio encoder and a large language model. It is trained on a large dataset, with the audio encoder initialized based on the Whisper-large-v3 model. The model is evaluated on various benchmarks, including ASR, S2TT, SER, VSC, and the AIR-Bench chat benchmark. Qwen2-Audio achieves state-of-the-art performance on these tasks without task-specific fine-tuning. It demonstrates strong capabilities in audio understanding and dialogue, with examples showing its ability to handle complex audio scenarios. The model is designed to provide seamless voice and text interactions, with no need for users to distinguish between the two modes. The results show that Qwen2-Audio significantly outperforms other LALMs in various tasks. The model is open-sourced to promote the development of the multi-modal language community.This report introduces Qwen2-Audio, a large-scale audio-language model that can process audio and text inputs to generate textual outputs. It improves instruction-following capabilities and supports two interaction modes: audio analysis and voice chat. In audio analysis mode, users can provide audio and text instructions for analysis, while in voice chat mode, users can interact with Qwen2-Audio without text input. The model uses natural language prompts for pre-training, simplifying the process and expanding the data volume. It also employs DPO to optimize performance in terms of factuality and adherence to desired behavior. Evaluation results show that Qwen2-Audio outperforms previous SOTAs in audio-centric instruction-following tasks. It is open-sourced to foster the development of the multi-modal language community. The model's architecture includes an audio encoder and a large language model. It is trained on a large dataset, with the audio encoder initialized based on the Whisper-large-v3 model. The model is evaluated on various benchmarks, including ASR, S2TT, SER, VSC, and the AIR-Bench chat benchmark. Qwen2-Audio achieves state-of-the-art performance on these tasks without task-specific fine-tuning. It demonstrates strong capabilities in audio understanding and dialogue, with examples showing its ability to handle complex audio scenarios. The model is designed to provide seamless voice and text interactions, with no need for users to distinguish between the two modes. The results show that Qwen2-Audio significantly outperforms other LALMs in various tasks. The model is open-sourced to promote the development of the multi-modal language community.

Qwen2-Audio Technical Report

15 Jul 2024 | Yunfei Chu*, Jin Xu*, Qian Yang*, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou†, Jingren Zhou

15 Jul 2024 | Yunfei Chu, Jin Xu, Qian Yang*, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou†, Jingren Zhou