2 Aug 2024 | Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Anastasia Minaeva, Denis Dimitrov, Alexander Panchenko, Sergei Markov
MERA is a comprehensive benchmark for evaluating large language models (LLMs) and foundation models (FMs) in Russian. The benchmark includes 21 tasks covering 10 skills, designed to assess the performance of LLMs and FMs in a fixed zero- and few-shot instruction setting. It includes private answer scoring to prevent data leakage and provides an open-source codebase, a leaderboard, and a submission system. The benchmark aims to standardize evaluation procedures, address ethical concerns, and guide future research. It evaluates open LMs as baselines and finds they are still far behind human performance. The benchmark includes tasks for problem-solving, exam-based, and diagnostic (ethics) evaluations. It also includes datasets for various tasks, such as classification, multiple-choice, and free-form answers. The benchmark is designed to be flexible and can be extended to other modalities like images and audio. The evaluation methodology includes log-likelihood and greedy generation strategies for different tasks. The benchmark also includes human baselines and random baselines for comparison. The results show that most models perform near-randomly on most tasks, except for some models that show better performance on specific tasks. The benchmark also highlights the need for more ethical considerations in evaluating LLMs for Russian. The benchmark is open-source and available under the MIT license, encouraging community contributions and collaboration. The benchmark aims to foster a reliable and standardized evaluation of LLMs and FMs in Russian, promoting the development of more robust and reliable models.MERA is a comprehensive benchmark for evaluating large language models (LLMs) and foundation models (FMs) in Russian. The benchmark includes 21 tasks covering 10 skills, designed to assess the performance of LLMs and FMs in a fixed zero- and few-shot instruction setting. It includes private answer scoring to prevent data leakage and provides an open-source codebase, a leaderboard, and a submission system. The benchmark aims to standardize evaluation procedures, address ethical concerns, and guide future research. It evaluates open LMs as baselines and finds they are still far behind human performance. The benchmark includes tasks for problem-solving, exam-based, and diagnostic (ethics) evaluations. It also includes datasets for various tasks, such as classification, multiple-choice, and free-form answers. The benchmark is designed to be flexible and can be extended to other modalities like images and audio. The evaluation methodology includes log-likelihood and greedy generation strategies for different tasks. The benchmark also includes human baselines and random baselines for comparison. The results show that most models perform near-randomly on most tasks, except for some models that show better performance on specific tasks. The benchmark also highlights the need for more ethical considerations in evaluating LLMs for Russian. The benchmark is open-source and available under the MIT license, encouraging community contributions and collaboration. The benchmark aims to foster a reliable and standardized evaluation of LLMs and FMs in Russian, promoting the development of more robust and reliable models.