[slides] OlympicArena%3A Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

OlympicArena is a comprehensive benchmark designed to evaluate the cognitive reasoning abilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) through a diverse set of problems from seven disciplines: mathematics, physics, chemistry, biology, geography, astronomy, and computer science. The benchmark includes 11,163 bilingual problems across text-only and interleaved text-image modalities, covering 62 international Olympic competitions. It features a detailed, fine-grained evaluation mechanism, including process-level evaluations that assess the step-by-step reasoning processes of AI models. The benchmark is bilingual (English and Chinese) and supports both text-only and interleaved text-image modalities, catering to the evolving complexity of tasks that modern AI systems must handle. The benchmark includes a comprehensive set of problems, categorized into 13 answer types, and is rigorously curated to ensure data leakage is minimal. The benchmark includes a detailed evaluation tool, an open-source annotation platform, and a leaderboard with automatic submission features. The benchmark is designed to assess not only the final answers but also the reasoning processes, which is critical for tasks requiring complex reasoning with lengthy solutions. Experiments on various LLMs and LMMs, including GPT-4o and LLaVa-NeXT, show that even advanced models achieve only a 39.97% overall accuracy, highlighting current AI limitations in complex reasoning and multimodal integration. The benchmark also reveals that LMMs are particularly weak in handling complex, decompositional reasoning problems and exhibit poor spatial and geometric perception abilities. Additionally, current LMMs struggle to leverage interleaved visual information for complex cognitive reasoning problems. The benchmark includes a detailed analysis of cognitive reasoning abilities, categorizing them into logical and visual reasoning abilities. The results show that models perform well in certain types of reasoning but poorly in others, indicating the need for further improvements in AI reasoning capabilities. The benchmark also includes data leakage detection experiments, which show that instances of data leakage are exceedingly rare, suggesting the need for more advanced training strategies to enhance cognitive reasoning capabilities. Overall, OlympicArena provides a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features. The benchmark aims to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond.OlympicArena is a comprehensive benchmark designed to evaluate the cognitive reasoning abilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) through a diverse set of problems from seven disciplines: mathematics, physics, chemistry, biology, geography, astronomy, and computer science. The benchmark includes 11,163 bilingual problems across text-only and interleaved text-image modalities, covering 62 international Olympic competitions. It features a detailed, fine-grained evaluation mechanism, including process-level evaluations that assess the step-by-step reasoning processes of AI models. The benchmark is bilingual (English and Chinese) and supports both text-only and interleaved text-image modalities, catering to the evolving complexity of tasks that modern AI systems must handle. The benchmark includes a comprehensive set of problems, categorized into 13 answer types, and is rigorously curated to ensure data leakage is minimal. The benchmark includes a detailed evaluation tool, an open-source annotation platform, and a leaderboard with automatic submission features. The benchmark is designed to assess not only the final answers but also the reasoning processes, which is critical for tasks requiring complex reasoning with lengthy solutions. Experiments on various LLMs and LMMs, including GPT-4o and LLaVa-NeXT, show that even advanced models achieve only a 39.97% overall accuracy, highlighting current AI limitations in complex reasoning and multimodal integration. The benchmark also reveals that LMMs are particularly weak in handling complex, decompositional reasoning problems and exhibit poor spatial and geometric perception abilities. Additionally, current LMMs struggle to leverage interleaved visual information for complex cognitive reasoning problems. The benchmark includes a detailed analysis of cognitive reasoning abilities, categorizing them into logical and visual reasoning abilities. The results show that models perform well in certain types of reasoning but poorly in others, indicating the need for further improvements in AI reasoning capabilities. The benchmark also includes data leakage detection experiments, which show that instances of data leakage are exceedingly rare, suggesting the need for more advanced training strategies to enhance cognitive reasoning capabilities. Overall, OlympicArena provides a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features. The benchmark aims to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond.

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI