TravelPlanner: A Benchmark for Real-World Planning with Language Agents

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

23 Jun 2024 | Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, Yu Su
TravelPlanner is a new benchmark for evaluating the planning capabilities of language agents in real-world scenarios, specifically focusing on travel planning. It provides a rich sandbox environment with access to nearly four million data records and includes 1,225 meticulously curated planning intents and reference plans. The benchmark evaluates how well language agents can handle complex planning tasks, including multiple constraints, such as budget, user needs, and commonsense rules. Comprehensive evaluations show that current language agents, including GPT-4, struggle with these tasks, achieving a success rate of only 0.6%. The benchmark highlights the challenges of planning in complex, real-world settings and the need for more sophisticated planning strategies. TravelPlanner provides a challenging yet meaningful testbed for future language agents to improve their planning abilities. The benchmark includes various tools for information collection and evaluation metrics to assess the agents' ability to follow constraints and generate feasible plans. The results indicate that current language agents are not yet capable of handling complex planning tasks, but the possibility of achieving this is a significant step forward. The benchmark also emphasizes the importance of considering multiple constraints and the need for agents to adapt to dynamic environments. Overall, TravelPlanner serves as a critical benchmark for advancing the capabilities of language agents in complex planning tasks.TravelPlanner is a new benchmark for evaluating the planning capabilities of language agents in real-world scenarios, specifically focusing on travel planning. It provides a rich sandbox environment with access to nearly four million data records and includes 1,225 meticulously curated planning intents and reference plans. The benchmark evaluates how well language agents can handle complex planning tasks, including multiple constraints, such as budget, user needs, and commonsense rules. Comprehensive evaluations show that current language agents, including GPT-4, struggle with these tasks, achieving a success rate of only 0.6%. The benchmark highlights the challenges of planning in complex, real-world settings and the need for more sophisticated planning strategies. TravelPlanner provides a challenging yet meaningful testbed for future language agents to improve their planning abilities. The benchmark includes various tools for information collection and evaluation metrics to assess the agents' ability to follow constraints and generate feasible plans. The results indicate that current language agents are not yet capable of handling complex planning tasks, but the possibility of achieving this is a significant step forward. The benchmark also emphasizes the importance of considering multiple constraints and the need for agents to adapt to dynamic environments. Overall, TravelPlanner serves as a critical benchmark for advancing the capabilities of language agents in complex planning tasks.
Reach us at info@study.space