What Makes Math Word Problems Challenging for LLMs?

What Makes Math Word Problems Challenging for LLMs?

2024 | KV Aditya Srivatsa, Ekaterina Kochmar
This paper investigates what makes math word problems (MWPs) in English challenging for large language models (LLMs). The study analyzes the linguistic and mathematical characteristics of MWPs and trains feature-based classifiers to understand how each feature affects the difficulty of MWPs for LLMs. The goal is to determine which features make MWPs challenging and whether they can be used to predict how well LLMs perform on specific categories of MWPs. The study uses the GSM8K dataset, which contains a diverse set of math problems tailored for LLMs. The dataset is divided into training and test instances, and the study evaluates the performance of several LLMs, including Llama2 (13B and 70B), Mistral-7B, and MetaMath-13B. The study extracts features from the questions and their solutions, categorizing them into linguistic, mathematical, and world knowledge & NLU-based features. The study finds that questions with a high number and diversity of math operations, and the use of infrequent numerical tokens, are particularly challenging for LLMs. Additionally, lengthy questions with low readability scores and those requiring real-world knowledge are also seldom solved correctly. The study also shows that the success rate of LLMs is influenced by the linguistic complexity of the questions, the number of steps and types of math operations involved, and the amount of real-world knowledge required to solve the tasks. The study trains classifiers to predict the success rate of LLMs on specific MWPs and finds that Random Forest outperforms other classifiers across most solution sets. The study also performs ablation studies to measure the impact of each feature type and finds that mathematical features are most important for Llama2-13B, while linguistic and math features are important for Mistral-7B and MetaMath-13B. The study also finds that the success rate of LLMs is influenced by the readability of the questions and the need for extraneous information. The study concludes that the features that make MWPs challenging for LLMs include the linguistic complexity of the questions, the conceptual complexity of the tasks, and the amount of real-world knowledge required to solve the tasks. The study also finds that the success rate of LLMs is influenced by the number of steps and types of math operations involved in the solution. The study suggests that future work should focus on modifying questions to better understand the impact on LLMs' reasoning and MWP-solving abilities.This paper investigates what makes math word problems (MWPs) in English challenging for large language models (LLMs). The study analyzes the linguistic and mathematical characteristics of MWPs and trains feature-based classifiers to understand how each feature affects the difficulty of MWPs for LLMs. The goal is to determine which features make MWPs challenging and whether they can be used to predict how well LLMs perform on specific categories of MWPs. The study uses the GSM8K dataset, which contains a diverse set of math problems tailored for LLMs. The dataset is divided into training and test instances, and the study evaluates the performance of several LLMs, including Llama2 (13B and 70B), Mistral-7B, and MetaMath-13B. The study extracts features from the questions and their solutions, categorizing them into linguistic, mathematical, and world knowledge & NLU-based features. The study finds that questions with a high number and diversity of math operations, and the use of infrequent numerical tokens, are particularly challenging for LLMs. Additionally, lengthy questions with low readability scores and those requiring real-world knowledge are also seldom solved correctly. The study also shows that the success rate of LLMs is influenced by the linguistic complexity of the questions, the number of steps and types of math operations involved, and the amount of real-world knowledge required to solve the tasks. The study trains classifiers to predict the success rate of LLMs on specific MWPs and finds that Random Forest outperforms other classifiers across most solution sets. The study also performs ablation studies to measure the impact of each feature type and finds that mathematical features are most important for Llama2-13B, while linguistic and math features are important for Mistral-7B and MetaMath-13B. The study also finds that the success rate of LLMs is influenced by the readability of the questions and the need for extraneous information. The study concludes that the features that make MWPs challenging for LLMs include the linguistic complexity of the questions, the conceptual complexity of the tasks, and the amount of real-world knowledge required to solve the tasks. The study also finds that the success rate of LLMs is influenced by the number of steps and types of math operations involved in the solution. The study suggests that future work should focus on modifying questions to better understand the impact on LLMs' reasoning and MWP-solving abilities.
Reach us at info@futurestudyspace.com
[slides] What Makes Math Word Problems Challenging for LLMs%3F | StudySpace