6 Jun 2024 | Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang*, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou
The paper introduces NATURAL PLAN, a benchmark designed to evaluate the planning capabilities of large language models (LLMs) in natural language. NATURAL PLAN includes three key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. The evaluation focuses on the planning capabilities of LLMs with full information, using tools like Google Flights, Google Maps, and Google Calendar as context. The benchmark is challenging for state-of-the-art models, with GPT-4 and Gemini 1.5 Pro achieving only 31.1% and 34.8% solve rates in Trip Planning, respectively. Performance drops significantly as problem complexity increases, with all models performing below 5% when there are 10 cities. The paper also conducts extensive ablation studies on NATURAL PLAN to understand the effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts. The results show that while models struggle with complex tasks, in-context planning with long contexts can significantly improve performance, with Gemini 1.5 Pro outperforming other models.The paper introduces NATURAL PLAN, a benchmark designed to evaluate the planning capabilities of large language models (LLMs) in natural language. NATURAL PLAN includes three key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. The evaluation focuses on the planning capabilities of LLMs with full information, using tools like Google Flights, Google Maps, and Google Calendar as context. The benchmark is challenging for state-of-the-art models, with GPT-4 and Gemini 1.5 Pro achieving only 31.1% and 34.8% solve rates in Trip Planning, respectively. Performance drops significantly as problem complexity increases, with all models performing below 5% when there are 10 cities. The paper also conducts extensive ablation studies on NATURAL PLAN to understand the effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts. The results show that while models struggle with complex tasks, in-context planning with long contexts can significantly improve performance, with Gemini 1.5 Pro outperforming other models.