CHAMP is a competition-level dataset for fine-grained analyses of large language models (LLMs)' mathematical reasoning capabilities. The dataset consists of 270 high school math competition problems, annotated with concepts, hints, and model-generated solutions. The annotations allow for the exploration of how LLMs use additional information such as hints and concepts to solve problems. The dataset is challenging, with the best model scoring 58.1% in standard settings. Performance sometimes improves with concepts and hints, indicating that some models can use such information. However, models often arrive at the correct final answer through wrong reasoning steps, and most struggle to verify their solutions.
The dataset includes annotations for the first wrong step in the reasoning process, enabling fine-grained evaluations of LLMs' solution verification abilities. The dataset is used to evaluate ten models, including GPT-3.5, GPT-4, PaLM 2 Medium, Llama 2, Llama 3, Mistral 7B, and Mixtral 8x22B. The results show that while some models perform well on final answer accuracy, they often fail to correctly verify their solutions. The dataset highlights the need for more detailed and multi-faceted evaluations of LLMs' mathematical reasoning capabilities. The findings indicate that current models have significant room for improvement in competition-level math problems and that the CHAMP dataset is a valuable resource for benchmarking and developing future models.CHAMP is a competition-level dataset for fine-grained analyses of large language models (LLMs)' mathematical reasoning capabilities. The dataset consists of 270 high school math competition problems, annotated with concepts, hints, and model-generated solutions. The annotations allow for the exploration of how LLMs use additional information such as hints and concepts to solve problems. The dataset is challenging, with the best model scoring 58.1% in standard settings. Performance sometimes improves with concepts and hints, indicating that some models can use such information. However, models often arrive at the correct final answer through wrong reasoning steps, and most struggle to verify their solutions.
The dataset includes annotations for the first wrong step in the reasoning process, enabling fine-grained evaluations of LLMs' solution verification abilities. The dataset is used to evaluate ten models, including GPT-3.5, GPT-4, PaLM 2 Medium, Llama 2, Llama 3, Mistral 7B, and Mixtral 8x22B. The results show that while some models perform well on final answer accuracy, they often fail to correctly verify their solutions. The dataset highlights the need for more detailed and multi-faceted evaluations of LLMs' mathematical reasoning capabilities. The findings indicate that current models have significant room for improvement in competition-level math problems and that the CHAMP dataset is a valuable resource for benchmarking and developing future models.