The paper introduces CHAMP (Concept and Hint-Annotated Math Problems), a dataset designed to evaluate the mathematical reasoning capabilities of large language models (LLMs). CHAMP consists of 270 high school-level math competition problems, each annotated with relevant concepts and hints. The dataset aims to explore how LLMs utilize additional information, such as problem-specific hints, to solve complex problems. The authors evaluate ten models, including GPT-3.5/4/4 Turbo, PaLM 2 Medium, Llama 2 7B/70B, Llama 3 8B/70B, Mistral 7B, and Mistral 8x22B, using 17 different prompts. The results show that while the best model scores only 58.1% in standard settings, performance improves with concepts and hints, indicating that some models can use such information. The authors also analyze the correctness of model-generated solutions, finding that many models arrive at correct answers through incorrect reasoning steps. Additionally, the models struggle with verifying solutions, highlighting the need for more fine-grained evaluations. The paper concludes by discussing the strengths and limitations of current models and suggesting future directions for improvement.The paper introduces CHAMP (Concept and Hint-Annotated Math Problems), a dataset designed to evaluate the mathematical reasoning capabilities of large language models (LLMs). CHAMP consists of 270 high school-level math competition problems, each annotated with relevant concepts and hints. The dataset aims to explore how LLMs utilize additional information, such as problem-specific hints, to solve complex problems. The authors evaluate ten models, including GPT-3.5/4/4 Turbo, PaLM 2 Medium, Llama 2 7B/70B, Llama 3 8B/70B, Mistral 7B, and Mistral 8x22B, using 17 different prompts. The results show that while the best model scores only 58.1% in standard settings, performance improves with concepts and hints, indicating that some models can use such information. The authors also analyze the correctness of model-generated solutions, finding that many models arrive at correct answers through incorrect reasoning steps. Additionally, the models struggle with verifying solutions, highlighting the need for more fine-grained evaluations. The paper concludes by discussing the strengths and limitations of current models and suggesting future directions for improvement.