4 Jul 2024 | Eunsu Kim, Juyoung Suk, Philhoon Oh, Haneul Yoo, James Thorne, Alice Oh
The paper introduces CLiCK, a benchmark dataset for evaluating the cultural and linguistic intelligence of large language models (LLMs) in Korean. Despite advancements in LLMs for Korean, there is a lack of comprehensive benchmark datasets that capture the unique cultural and linguistic contexts of the language. CLiCK addresses this gap by providing a dataset of 1,995 QA pairs sourced from official Korean exams and textbooks, categorized into eleven subcategories under two main categories: language and culture. The dataset is annotated with fine-grained details of the required cultural and linguistic knowledge. The authors evaluate 13 LLMs using CLiCK, revealing insights into their performance across various categories and highlighting the limitations of current models. The results show that open-source models perform poorly, while proprietary models like GPT-3.5 and Claude-2 outperform them but still struggle in certain areas. The study emphasizes the need for more tailored methods to enhance cultural intelligence in non-English languages. CLiCK is publicly available and aims to contribute to further research on cultural and linguistic benchmarks.The paper introduces CLiCK, a benchmark dataset for evaluating the cultural and linguistic intelligence of large language models (LLMs) in Korean. Despite advancements in LLMs for Korean, there is a lack of comprehensive benchmark datasets that capture the unique cultural and linguistic contexts of the language. CLiCK addresses this gap by providing a dataset of 1,995 QA pairs sourced from official Korean exams and textbooks, categorized into eleven subcategories under two main categories: language and culture. The dataset is annotated with fine-grained details of the required cultural and linguistic knowledge. The authors evaluate 13 LLMs using CLiCK, revealing insights into their performance across various categories and highlighting the limitations of current models. The results show that open-source models perform poorly, while proprietary models like GPT-3.5 and Claude-2 outperform them but still struggle in certain areas. The study emphasizes the need for more tailored methods to enhance cultural intelligence in non-English languages. CLiCK is publicly available and aims to contribute to further research on cultural and linguistic benchmarks.