This technical report introduces the Qwen2 series, the latest addition to Alibaba's large language models and large multimodal models. The report presents a comprehensive suite of foundational and instruction-tuned language models, ranging from 0.5 to 72 billion parameters, including dense models and a Mixture-of-Experts (MoE) model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across various benchmarks in language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning.
The flagship model, Qwen2-72B, demonstrates strong performance on multiple benchmarks, including 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, achieves 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Qwen2 also shows robust multilingual capabilities, supporting approximately 30 languages.
To foster community innovation and accessibility, the Qwen2 model weights are publicly available on Hugging Face and ModelScope, along with supplementary materials on GitHub. These platforms provide resources for quantization, fine-tuning, and deployment, enabling a wide range of applications and research.
The Qwen2 series includes models of different sizes, with the largest being Qwen2-72B and the smallest being Qwen2-0.5B. The models are pre-trained on a large-scale dataset of over 7 trillion tokens, covering a wide range of domains and languages. Post-training involves supervised fine-tuning and direct preference optimization (DPO), aligning the models with human preferences through learning from human feedback.
Qwen2 has been evaluated on various benchmarks, demonstrating strong performance in language understanding, coding, mathematics, reasoning, and multilingual capabilities. The instruction-tuned variant, Qwen2-72B-Instruct, outperforms other models in multiple benchmarks, including MT-Bench, Arena-Hard, and LiveCodeBench. Qwen2 also shows strong performance in multilingual evaluations and safety assessments, outperforming proprietary models in some cases.
The report concludes that Qwen2 is a versatile and powerful series of language models, with strong performance across a wide range of tasks and benchmarks. The models are designed to be accessible and usable for a wide range of applications and research.This technical report introduces the Qwen2 series, the latest addition to Alibaba's large language models and large multimodal models. The report presents a comprehensive suite of foundational and instruction-tuned language models, ranging from 0.5 to 72 billion parameters, including dense models and a Mixture-of-Experts (MoE) model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across various benchmarks in language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning.
The flagship model, Qwen2-72B, demonstrates strong performance on multiple benchmarks, including 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, achieves 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Qwen2 also shows robust multilingual capabilities, supporting approximately 30 languages.
To foster community innovation and accessibility, the Qwen2 model weights are publicly available on Hugging Face and ModelScope, along with supplementary materials on GitHub. These platforms provide resources for quantization, fine-tuning, and deployment, enabling a wide range of applications and research.
The Qwen2 series includes models of different sizes, with the largest being Qwen2-72B and the smallest being Qwen2-0.5B. The models are pre-trained on a large-scale dataset of over 7 trillion tokens, covering a wide range of domains and languages. Post-training involves supervised fine-tuning and direct preference optimization (DPO), aligning the models with human preferences through learning from human feedback.
Qwen2 has been evaluated on various benchmarks, demonstrating strong performance in language understanding, coding, mathematics, reasoning, and multilingual capabilities. The instruction-tuned variant, Qwen2-72B-Instruct, outperforms other models in multiple benchmarks, including MT-Bench, Arena-Hard, and LiveCodeBench. Qwen2 also shows strong performance in multilingual evaluations and safety assessments, outperforming proprietary models in some cases.
The report concludes that Qwen2 is a versatile and powerful series of language models, with strong performance across a wide range of tasks and benchmarks. The models are designed to be accessible and usable for a wide range of applications and research.