OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

18 Jun 2024 | Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shooting Zhang, Dahua Lin, Yu Qiao, Pengfei Liu
**OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI** This paper introduces *OlympicArena*, a comprehensive benchmark designed to evaluate the cognitive reasoning abilities of large language models (LLMs) and large multimodal models (LMMs) on a wide range of interdisciplinary problems. The benchmark includes 11,163 bilingual problems across seven disciplines—mathematics, physics, chemistry, biology, geography, astronomy, and computer science—and 62 international Olympic competitions. These problems are structured into 13 answer types and are evaluated using both answer-level and process-level criteria. The *OlympicArena* benchmark is designed to be highly challenging and rigorous, with a focus on complex and interdisciplinary tasks. It incorporates process-level evaluations to assess the step-by-step reasoning processes of AI models, which is crucial for understanding the depth of cognitive reasoning beyond correct answers. The benchmark also supports both text-only and interleaved text-image modalities, enhancing its applicability to real-world scenarios. Experiments with top-performing LMMs and LLMs, including GPT-4o, show that even advanced models achieve only a 39.97% overall accuracy, highlighting current limitations in handling complex, multidisciplinary problems. Further analysis reveals that models struggle with decompositional reasoning, spatial and geometric perception, and abstract symbol interpretation. The benchmark also demonstrates that current models are not proficient at leveraging visual information in multimodal tasks. The paper provides a detailed analysis of the benchmark's design, data collection, annotation process, and experimental setup. It also includes a comprehensive set of resources, such as a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features. The *OlympicArena* benchmark aims to advance AI towards superintelligence by providing a rigorous framework to assess and improve cognitive reasoning capabilities.**OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI** This paper introduces *OlympicArena*, a comprehensive benchmark designed to evaluate the cognitive reasoning abilities of large language models (LLMs) and large multimodal models (LMMs) on a wide range of interdisciplinary problems. The benchmark includes 11,163 bilingual problems across seven disciplines—mathematics, physics, chemistry, biology, geography, astronomy, and computer science—and 62 international Olympic competitions. These problems are structured into 13 answer types and are evaluated using both answer-level and process-level criteria. The *OlympicArena* benchmark is designed to be highly challenging and rigorous, with a focus on complex and interdisciplinary tasks. It incorporates process-level evaluations to assess the step-by-step reasoning processes of AI models, which is crucial for understanding the depth of cognitive reasoning beyond correct answers. The benchmark also supports both text-only and interleaved text-image modalities, enhancing its applicability to real-world scenarios. Experiments with top-performing LMMs and LLMs, including GPT-4o, show that even advanced models achieve only a 39.97% overall accuracy, highlighting current limitations in handling complex, multidisciplinary problems. Further analysis reveals that models struggle with decompositional reasoning, spatial and geometric perception, and abstract symbol interpretation. The benchmark also demonstrates that current models are not proficient at leveraging visual information in multimodal tasks. The paper provides a detailed analysis of the benchmark's design, data collection, annotation process, and experimental setup. It also includes a comprehensive set of resources, such as a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features. The *OlympicArena* benchmark aims to advance AI towards superintelligence by providing a rigorous framework to assess and improve cognitive reasoning capabilities.
Reach us at info@study.space
Understanding OlympicArena%3A Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI