CLiCK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean

CLiCK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean

4 Jul 2024 | Eun-su Kim, Juyoung Suk, Philhoon Oh, Haneul Yoo, James Thorne, Alice Oh
CLICK is a benchmark dataset of cultural and linguistic intelligence in Korean, containing 1,995 QA pairs. The dataset is sourced from official Korean exams and textbooks, and is categorized into two main categories—Korean Culture and Korean Language—with 11 subcategories. Each question is annotated with the specific cultural and linguistic knowledge required to answer it correctly. The dataset is publicly available at https://github.com/rladmstn1714/CLIcK. CLICK addresses the lack of benchmark datasets that test Korean cultural and linguistic knowledge, which is often overlooked in existing datasets derived from English counterparts. The dataset provides a comprehensive analysis of LLMs' proficiency in Korean culture and language, and includes evaluations of 13 language models. The results show that open-source models perform poorly, with accuracy ranging from 10% to 50%, while proprietary models like GPT-3.5 and Claude-2 outperform these but still struggle in some categories. GPT-3.5 scores in the lowest 11th percentile on Korean tests, compared to the top 13th percentile on the English SAT. CLICK provides fine-grained evaluations of LLMs on eleven diverse topics, contributing to further research on the assessment of Korean cultural and linguistic knowledge within LLMs. The dataset is constructed through three stages: data collection, data validation, and data categorization. Data is collected from six Korean examinations and one textbook, and validated by four native Korean speakers. The dataset is categorized into Korean Culture and Korean Language, with eight subcategories for cultural intelligence and three for linguistic intelligence. CLICK is evaluated using a variety of models, including open-source and proprietary LLMs. The evaluation methodology follows MMLU, and the results show that model scale does not significantly impact accuracy. Additionally, the dataset includes a qualitative analysis of model performance, revealing that models struggle with cultural and linguistic intelligence tasks. The results highlight the need for more tailored methods in further research on cultural and linguistic benchmarks.CLICK is a benchmark dataset of cultural and linguistic intelligence in Korean, containing 1,995 QA pairs. The dataset is sourced from official Korean exams and textbooks, and is categorized into two main categories—Korean Culture and Korean Language—with 11 subcategories. Each question is annotated with the specific cultural and linguistic knowledge required to answer it correctly. The dataset is publicly available at https://github.com/rladmstn1714/CLIcK. CLICK addresses the lack of benchmark datasets that test Korean cultural and linguistic knowledge, which is often overlooked in existing datasets derived from English counterparts. The dataset provides a comprehensive analysis of LLMs' proficiency in Korean culture and language, and includes evaluations of 13 language models. The results show that open-source models perform poorly, with accuracy ranging from 10% to 50%, while proprietary models like GPT-3.5 and Claude-2 outperform these but still struggle in some categories. GPT-3.5 scores in the lowest 11th percentile on Korean tests, compared to the top 13th percentile on the English SAT. CLICK provides fine-grained evaluations of LLMs on eleven diverse topics, contributing to further research on the assessment of Korean cultural and linguistic knowledge within LLMs. The dataset is constructed through three stages: data collection, data validation, and data categorization. Data is collected from six Korean examinations and one textbook, and validated by four native Korean speakers. The dataset is categorized into Korean Culture and Korean Language, with eight subcategories for cultural intelligence and three for linguistic intelligence. CLICK is evaluated using a variety of models, including open-source and proprietary LLMs. The evaluation methodology follows MMLU, and the results show that model scale does not significantly impact accuracy. Additionally, the dataset includes a qualitative analysis of model performance, revealing that models struggle with cultural and linguistic intelligence tasks. The results highlight the need for more tailored methods in further research on cultural and linguistic benchmarks.
Reach us at info@study.space