6 Jun 2024 | Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman
KMMLU is a new Korean benchmark consisting of 35,030 expert-level multiple-choice questions across 45 subjects, sourced from original Korean exams rather than translated English benchmarks. Unlike previous Korean benchmarks, which are based on translated English datasets, KMMLU captures authentic Korean linguistic and cultural aspects. The benchmark includes questions from diverse fields such as humanities, STEM, and applied science, with a focus on Korean-specific knowledge. The dataset is publicly available on the Hugging Face Hub, and the benchmark is integrated into EleutherAI's Language Model Evaluation Harness.
The authors evaluated 27 LLMs, including multilingual pretrained models, multilingual chat models, Korean pretrained models, Korean continual pretrained models, and proprietary models. The results show that while the best public model scores 50.5%, there is significant room for improvement. Korean-specific models like POLYGLOT-KO perform poorly, and even the most capable proprietary models like GPT-4 and HYPERCLOVA X score below 60%. This suggests that further work is needed to improve LLMs for Korean. The benchmark also includes a subset, KMMLU-HARD, with questions that are particularly challenging for LLMs.
The authors also created CoT exemplars to test models' reasoning capabilities on the benchmark. They found that while some models show improvement with CoT prompting, others degrade in performance. The analysis highlights the importance of Korean-specific knowledge in question-answering tasks and the challenges non-Korean LLMs face in producing accurate and reliable Korean explanations.
The study shows that while multilingual models like LLAMA-2, Yi, and Qwen outperform Korean-specific models like POLYGLOT-KO, there is still room for improvement. The results indicate that scaled decoder-only models can acquire capabilities in languages they are severely undertrained in, a finding that aligns with prior work. The study also highlights the importance of continual pretraining for improving model performance.
The authors conclude that KMMLU provides a comprehensive benchmark for evaluating LLMs in Korean, highlighting the need for further research to improve Korean proficiency in state-of-the-art LLMs. The benchmark also emphasizes the importance of Korean pre-training for better performance in Korea-specific contexts. The study acknowledges the limitations of the benchmark, including coverage gaps and potential misuse of benchmarks. The authors thank the contributors and acknowledge the challenges in creating a comprehensive benchmark.KMMLU is a new Korean benchmark consisting of 35,030 expert-level multiple-choice questions across 45 subjects, sourced from original Korean exams rather than translated English benchmarks. Unlike previous Korean benchmarks, which are based on translated English datasets, KMMLU captures authentic Korean linguistic and cultural aspects. The benchmark includes questions from diverse fields such as humanities, STEM, and applied science, with a focus on Korean-specific knowledge. The dataset is publicly available on the Hugging Face Hub, and the benchmark is integrated into EleutherAI's Language Model Evaluation Harness.
The authors evaluated 27 LLMs, including multilingual pretrained models, multilingual chat models, Korean pretrained models, Korean continual pretrained models, and proprietary models. The results show that while the best public model scores 50.5%, there is significant room for improvement. Korean-specific models like POLYGLOT-KO perform poorly, and even the most capable proprietary models like GPT-4 and HYPERCLOVA X score below 60%. This suggests that further work is needed to improve LLMs for Korean. The benchmark also includes a subset, KMMLU-HARD, with questions that are particularly challenging for LLMs.
The authors also created CoT exemplars to test models' reasoning capabilities on the benchmark. They found that while some models show improvement with CoT prompting, others degrade in performance. The analysis highlights the importance of Korean-specific knowledge in question-answering tasks and the challenges non-Korean LLMs face in producing accurate and reliable Korean explanations.
The study shows that while multilingual models like LLAMA-2, Yi, and Qwen outperform Korean-specific models like POLYGLOT-KO, there is still room for improvement. The results indicate that scaled decoder-only models can acquire capabilities in languages they are severely undertrained in, a finding that aligns with prior work. The study also highlights the importance of continual pretraining for improving model performance.
The authors conclude that KMMLU provides a comprehensive benchmark for evaluating LLMs in Korean, highlighting the need for further research to improve Korean proficiency in state-of-the-art LLMs. The benchmark also emphasizes the importance of Korean pre-training for better performance in Korea-specific contexts. The study acknowledges the limitations of the benchmark, including coverage gaps and potential misuse of benchmarks. The authors thank the contributors and acknowledge the challenges in creating a comprehensive benchmark.