15 Jun 2024 | Junlin Wang, Siddhartha Jain, Dejiao Zhang, Baishakhi Ray, Varun Kumar, Ben Athiwaratkun
This paper addresses the issue of evaluating reasoning strategies in large language models (LLMs) by introducing a budget-aware evaluation framework. Traditional evaluations often overlook the computational cost, which can significantly impact performance. The authors propose incorporating the compute budget into the evaluation metrics to provide a more comprehensive and fair comparison. They find that complex reasoning strategies, such as multi-agent debate and Reflexion, often do not outperform simpler baselines like chain-of-thought self-consistency (CoT SC) due to the larger computational resources allocated. When using comparable compute resources, CoT SC frequently outperforms more sophisticated strategies. The paper also explores the influence of different budgets, such as answer generation and evaluation budgets, on performance. It highlights that self-evaluation performance is dependent on the model and dataset, and there is a strong correlation between calibration via correctness prediction proxies and the success of reasoning strategies that leverage self-evaluation. The study provides insights into the effectiveness of various reasoning strategies and emphasizes the importance of efficient budget utilization. The contributions of the paper include a budget-aware evaluation framework, a comprehensive evaluation of seven LLM reasoning strategies across five datasets, and an in-depth analysis of the dynamics of specific strategies like MAD and Reflexion. The findings suggest that the observed improvements in performance are primarily due to increased budget allocations rather than the intrinsic merits of the methodologies.This paper addresses the issue of evaluating reasoning strategies in large language models (LLMs) by introducing a budget-aware evaluation framework. Traditional evaluations often overlook the computational cost, which can significantly impact performance. The authors propose incorporating the compute budget into the evaluation metrics to provide a more comprehensive and fair comparison. They find that complex reasoning strategies, such as multi-agent debate and Reflexion, often do not outperform simpler baselines like chain-of-thought self-consistency (CoT SC) due to the larger computational resources allocated. When using comparable compute resources, CoT SC frequently outperforms more sophisticated strategies. The paper also explores the influence of different budgets, such as answer generation and evaluation budgets, on performance. It highlights that self-evaluation performance is dependent on the model and dataset, and there is a strong correlation between calibration via correctness prediction proxies and the success of reasoning strategies that leverage self-evaluation. The study provides insights into the effectiveness of various reasoning strategies and emphasizes the importance of efficient budget utilization. The contributions of the paper include a budget-aware evaluation framework, a comprehensive evaluation of seven LLM reasoning strategies across five datasets, and an in-depth analysis of the dynamics of specific strategies like MAD and Reflexion. The findings suggest that the observed improvements in performance are primarily due to increased budget allocations rather than the intrinsic merits of the methodologies.