QWEN2 TECHNICAL REPORT

QWEN2 TECHNICAL REPORT

18 Jul 2024 | An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianxin Ma, Jianwei Tu, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruizie Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan
This technical report introduces the Qwen2 series, the latest addition to Alibaba's large language models and large multimodal models. The report presents a comprehensive suite of foundational and instruction-tuned language models, ranging from 0.5 to 72 billion parameters, including dense models and a Mixture-of-Experts (MoE) model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across various benchmarks in language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, demonstrates strong performance on multiple benchmarks, including 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, achieves 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Qwen2 also shows robust multilingual capabilities, supporting approximately 30 languages. To foster community innovation and accessibility, the Qwen2 model weights are publicly available on Hugging Face and ModelScope, along with supplementary materials on GitHub. These platforms provide resources for quantization, fine-tuning, and deployment, enabling a wide range of applications and research. The Qwen2 series includes models of different sizes, with the largest being Qwen2-72B and the smallest being Qwen2-0.5B. The models are pre-trained on a large-scale dataset of over 7 trillion tokens, covering a wide range of domains and languages. Post-training involves supervised fine-tuning and direct preference optimization (DPO), aligning the models with human preferences through learning from human feedback. Qwen2 has been evaluated on various benchmarks, demonstrating strong performance in language understanding, coding, mathematics, reasoning, and multilingual capabilities. The instruction-tuned variant, Qwen2-72B-Instruct, outperforms other models in multiple benchmarks, including MT-Bench, Arena-Hard, and LiveCodeBench. Qwen2 also shows strong performance in multilingual evaluations and safety assessments, outperforming proprietary models in some cases. The report concludes that Qwen2 is a versatile and powerful series of language models, with strong performance across a wide range of tasks and benchmarks. The models are designed to be accessible and usable for a wide range of applications and research.This technical report introduces the Qwen2 series, the latest addition to Alibaba's large language models and large multimodal models. The report presents a comprehensive suite of foundational and instruction-tuned language models, ranging from 0.5 to 72 billion parameters, including dense models and a Mixture-of-Experts (MoE) model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across various benchmarks in language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, demonstrates strong performance on multiple benchmarks, including 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, achieves 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Qwen2 also shows robust multilingual capabilities, supporting approximately 30 languages. To foster community innovation and accessibility, the Qwen2 model weights are publicly available on Hugging Face and ModelScope, along with supplementary materials on GitHub. These platforms provide resources for quantization, fine-tuning, and deployment, enabling a wide range of applications and research. The Qwen2 series includes models of different sizes, with the largest being Qwen2-72B and the smallest being Qwen2-0.5B. The models are pre-trained on a large-scale dataset of over 7 trillion tokens, covering a wide range of domains and languages. Post-training involves supervised fine-tuning and direct preference optimization (DPO), aligning the models with human preferences through learning from human feedback. Qwen2 has been evaluated on various benchmarks, demonstrating strong performance in language understanding, coding, mathematics, reasoning, and multilingual capabilities. The instruction-tuned variant, Qwen2-72B-Instruct, outperforms other models in multiple benchmarks, including MT-Bench, Arena-Hard, and LiveCodeBench. Qwen2 also shows strong performance in multilingual evaluations and safety assessments, outperforming proprietary models in some cases. The report concludes that Qwen2 is a versatile and powerful series of language models, with strong performance across a wide range of tasks and benchmarks. The models are designed to be accessible and usable for a wide range of applications and research.
Reach us at info@study.space
[slides and audio] Qwen2 Technical Report