Qwen2-Audio Technical Report

Qwen2-Audio Technical Report

15 Jul 2024 | Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou
The Qwen2-Audio technical report introduces a large-scale audio-language model designed to process various audio signals and respond to speech instructions. The model, Qwen2-Audio, simplifies the pre-training process by using natural language prompts and expands the training dataset. It features two modes: Audio Analysis and Voice Chat, allowing users to interact with the model through voice or text. Qwen2-Audio demonstrates superior performance in multiple benchmarks, including Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Recognition (SER), Vocal Sound Classification (VSC), and instruction-following tasks. The model's performance is further enhanced through Direct Preference Optimization (DPO) to align with human preferences. The report includes detailed evaluations and case studies to illustrate Qwen2-Audio's capabilities in audio analysis and interactive conversations. The open-source model aims to advance the multi-modal language community.The Qwen2-Audio technical report introduces a large-scale audio-language model designed to process various audio signals and respond to speech instructions. The model, Qwen2-Audio, simplifies the pre-training process by using natural language prompts and expands the training dataset. It features two modes: Audio Analysis and Voice Chat, allowing users to interact with the model through voice or text. Qwen2-Audio demonstrates superior performance in multiple benchmarks, including Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Recognition (SER), Vocal Sound Classification (VSC), and instruction-following tasks. The model's performance is further enhanced through Direct Preference Optimization (DPO) to align with human preferences. The report includes detailed evaluations and case studies to illustrate Qwen2-Audio's capabilities in audio analysis and interactive conversations. The open-source model aims to advance the multi-modal language community.
Reach us at info@study.space
Understanding Qwen2-Audio Technical Report