NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

6 Jun 2024 | Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang*, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou
NATURAL PLAN is a new benchmark for evaluating the planning capabilities of large language models (LLMs) in natural language. It includes three tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. The benchmark uses real-world data from tools like Google Flights, Google Maps, and Google Calendar to provide context for the models, eliminating the need for a tool-use environment. The tasks involve planning trips, scheduling meetings, and arranging calendar events under various constraints. The benchmark is challenging for state-of-the-art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro achieved 31.1% and 34.8% solve rates, respectively, while GPT-3.5 and GPT-4o performed much worse. Model performance drops significantly as task complexity increases, with all models performing below 5% when there are 10 cities. The benchmark also includes extensive ablation studies on approaches like self-correction, few-shot generalization, and in-context planning with long contexts. The dataset is constructed synthetically by using real-world data and adding constraints to create realistic planning scenarios. Trip Planning involves visiting cities with specific constraints, Meeting Planning involves scheduling meetings with multiple people, and Calendar Scheduling involves arranging meetings with existing schedules and constraints. Experiments show that Gemini 1.5 Pro outperforms other models in Trip Planning and Calendar Scheduling. In-context planning with long contexts significantly improves performance, with Gemini 1.5 Pro achieving up to 39.9% accuracy in Trip Planning and 48.9% in Calendar Scheduling. However, self-correction does not improve performance, and models struggle with complex tasks. The benchmark highlights the challenges of planning in natural language for LLMs, even with access to tool-use information. It provides a realistic evaluation of LLMs' planning capabilities and identifies areas for improvement. The results show that while LLMs can perform basic planning tasks, they still struggle with complex, real-world scenarios.NATURAL PLAN is a new benchmark for evaluating the planning capabilities of large language models (LLMs) in natural language. It includes three tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. The benchmark uses real-world data from tools like Google Flights, Google Maps, and Google Calendar to provide context for the models, eliminating the need for a tool-use environment. The tasks involve planning trips, scheduling meetings, and arranging calendar events under various constraints. The benchmark is challenging for state-of-the-art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro achieved 31.1% and 34.8% solve rates, respectively, while GPT-3.5 and GPT-4o performed much worse. Model performance drops significantly as task complexity increases, with all models performing below 5% when there are 10 cities. The benchmark also includes extensive ablation studies on approaches like self-correction, few-shot generalization, and in-context planning with long contexts. The dataset is constructed synthetically by using real-world data and adding constraints to create realistic planning scenarios. Trip Planning involves visiting cities with specific constraints, Meeting Planning involves scheduling meetings with multiple people, and Calendar Scheduling involves arranging meetings with existing schedules and constraints. Experiments show that Gemini 1.5 Pro outperforms other models in Trip Planning and Calendar Scheduling. In-context planning with long contexts significantly improves performance, with Gemini 1.5 Pro achieving up to 39.9% accuracy in Trip Planning and 48.9% in Calendar Scheduling. However, self-correction does not improve performance, and models struggle with complex tasks. The benchmark highlights the challenges of planning in natural language for LLMs, even with access to tool-use information. It provides a realistic evaluation of LLMs' planning capabilities and identifies areas for improvement. The results show that while LLMs can perform basic planning tasks, they still struggle with complex, real-world scenarios.
Reach us at info@study.space