6 Jun 2024 | Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman
The paper introduces KMMLU, a new Korean benchmark for measuring large language models (LLMs) in multitasking language understanding. KMMLU consists of 35,030 expert-level multiple-choice questions across 45 subjects, sourced directly from Korean exams to capture the linguistic and cultural nuances of the Korean language. The benchmark is designed to address the limitations of existing Korean benchmarks, which are often translated from English and lack cultural relevance.
The authors evaluate 27 public and proprietary LLMs on KMMLU, finding that the best-performing model scores only 50.5%, indicating significant room for improvement. Notably, models trained primarily for English and Chinese do not perform well on KMMLU, while Korean-specific models like POLYGLOT-Ko also struggle. Even advanced proprietary models like GPT-4 and HYPERCLOVA X do not exceed 60%, suggesting that further research is needed to improve LLMs for Korean.
The paper also includes a detailed analysis of how LLMs utilize Korean knowledge in question-answering, highlighting the importance of localized benchmarks. For instance, GPT-4 performs poorly in areas requiring localized knowledge, such as Korean history, while HYPERCLOVA X shows consistent improvement with Chain-of-Thought (CoT) prompting, indicating the challenges non-Korean LLMs face in producing accurate and reliable Korean explanations.
KMMLU is made publicly available on the Hugging Face Hub, and the evaluation codes are integrated into EleutherAI's Language Model Evaluation Harness. The authors aim to use KMMLU to track progress in improving LLMs for Korean and to provide a comprehensive tool for researchers to assess and develop better Korean LLMs.The paper introduces KMMLU, a new Korean benchmark for measuring large language models (LLMs) in multitasking language understanding. KMMLU consists of 35,030 expert-level multiple-choice questions across 45 subjects, sourced directly from Korean exams to capture the linguistic and cultural nuances of the Korean language. The benchmark is designed to address the limitations of existing Korean benchmarks, which are often translated from English and lack cultural relevance.
The authors evaluate 27 public and proprietary LLMs on KMMLU, finding that the best-performing model scores only 50.5%, indicating significant room for improvement. Notably, models trained primarily for English and Chinese do not perform well on KMMLU, while Korean-specific models like POLYGLOT-Ko also struggle. Even advanced proprietary models like GPT-4 and HYPERCLOVA X do not exceed 60%, suggesting that further research is needed to improve LLMs for Korean.
The paper also includes a detailed analysis of how LLMs utilize Korean knowledge in question-answering, highlighting the importance of localized benchmarks. For instance, GPT-4 performs poorly in areas requiring localized knowledge, such as Korean history, while HYPERCLOVA X shows consistent improvement with Chain-of-Thought (CoT) prompting, indicating the challenges non-Korean LLMs face in producing accurate and reliable Korean explanations.
KMMLU is made publicly available on the Hugging Face Hub, and the evaluation codes are integrated into EleutherAI's Language Model Evaluation Harness. The authors aim to use KMMLU to track progress in improving LLMs for Korean and to provide a comprehensive tool for researchers to assess and develop better Korean LLMs.