MERA: A Comprehensive LLM Evaluation in Russian

MERA: A Comprehensive LLM Evaluation in Russian

2 Aug 2024 | Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytsheva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Anastasia Minaeva, Denis Dimitrov, Alexander Panchenko, Sergei Markov
The paper introduces MERA, a comprehensive benchmark for evaluating foundation models (FMs) and large language models (LLMs) in Russian. MERA consists of 21 evaluation tasks covering 10 skills, designed to assess a wide range of abilities, including natural language understanding, expert knowledge, coding skills, and ethical biases. The benchmark is structured to evaluate models in zero- and few-shot instruction settings, ensuring reproducibility and fairness. It includes an open-source code base, a leaderboard, and a submission system. The authors evaluate open-source LMs as baselines and find that they still fall short of human performance. MERA aims to guide future research, standardize evaluation procedures, and address ethical concerns. The paper also discusses the limitations of current benchmarks and the need for more challenging and diverse evaluation tools.The paper introduces MERA, a comprehensive benchmark for evaluating foundation models (FMs) and large language models (LLMs) in Russian. MERA consists of 21 evaluation tasks covering 10 skills, designed to assess a wide range of abilities, including natural language understanding, expert knowledge, coding skills, and ethical biases. The benchmark is structured to evaluate models in zero- and few-shot instruction settings, ensuring reproducibility and fairness. It includes an open-source code base, a leaderboard, and a submission system. The authors evaluate open-source LMs as baselines and find that they still fall short of human performance. MERA aims to guide future research, standardize evaluation procedures, and address ethical concerns. The paper also discusses the limitations of current benchmarks and the need for more challenging and diverse evaluation tools.
Reach us at info@study.space