TravelPlanner: A Benchmark for Real-World Planning with Language Agents

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

23 Jun 2024 | Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, Yu Su
**TravelPlanner: A Benchmark for Real-World Planning with Language Agents** **Authors:** Jian Xie **Contact Information:** jianxie22m.fudan.edu.cn, shawyh@fudan.edu.cn, {zhang.13253, su.809}@osu.edu **Abstract:** Planning has been a core pursuit in artificial intelligence, but early AI agents primarily focused on constrained settings due to the lack of necessary cognitive substrates. Recently, language agents powered by large language models (LLMs) have shown capabilities in tool use and reasoning. To investigate whether these agents can handle more complex real-world planning tasks, we propose TravelPlanner, a benchmark focused on travel planning. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 curated planning intents and reference plans. Evaluations show that current language agents, even GPT-4, achieve only a 0.6% success rate in handling complex planning tasks. Agents struggle with staying on task, using the right tools, and keeping track of multiple constraints. However, the mere possibility of language agents tackling such complex problems is significant progress. TravelPlanner offers a challenging yet meaningful testbed for future language agents. **Introduction:** Planning is a hallmark of human intelligence, involving tool use, information collection, and decision-making. AI agents have been developed to mimic human planning, but they often operate in constrained settings. LLMs have emerged as a new generation of language agents, capable of using language for thought and communication. Previous research has explored their capabilities in various planning tasks, but most settings remain conventional and single-objective. TravelPlanner focuses on travel planning, a complex and realistic scenario involving long-horizon decisions, multiple constraints, and proactive information acquisition. It provides a rich sandbox environment with four million data entries and 1,225 diverse user queries. **Evaluation:** We evaluate five LLMs and four planning strategies on TravelPlanner. State-of-the-art LLMs, including GPT-4, Gemini, and Mixtral, achieve only a 0.6% success rate. Existing planning strategies like ReAct and Reflexion struggle with multi-constraint tasks, often failing to convert reasoning into correct actions. Common failure modes include argument errors, dead loops, and hallucinations. Despite these challenges, the possibility of language agents tackling complex planning tasks is a significant step forward. **Conclusion:** TravelPlanner is a challenging benchmark for evaluating multi-constRAINT planning and tool-use abilities of language agents. Even advanced models achieve only a 0.6% success rate, highlighting the need for further research to enhance agents' performance in complex scenarios.**TravelPlanner: A Benchmark for Real-World Planning with Language Agents** **Authors:** Jian Xie **Contact Information:** jianxie22m.fudan.edu.cn, shawyh@fudan.edu.cn, {zhang.13253, su.809}@osu.edu **Abstract:** Planning has been a core pursuit in artificial intelligence, but early AI agents primarily focused on constrained settings due to the lack of necessary cognitive substrates. Recently, language agents powered by large language models (LLMs) have shown capabilities in tool use and reasoning. To investigate whether these agents can handle more complex real-world planning tasks, we propose TravelPlanner, a benchmark focused on travel planning. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 curated planning intents and reference plans. Evaluations show that current language agents, even GPT-4, achieve only a 0.6% success rate in handling complex planning tasks. Agents struggle with staying on task, using the right tools, and keeping track of multiple constraints. However, the mere possibility of language agents tackling such complex problems is significant progress. TravelPlanner offers a challenging yet meaningful testbed for future language agents. **Introduction:** Planning is a hallmark of human intelligence, involving tool use, information collection, and decision-making. AI agents have been developed to mimic human planning, but they often operate in constrained settings. LLMs have emerged as a new generation of language agents, capable of using language for thought and communication. Previous research has explored their capabilities in various planning tasks, but most settings remain conventional and single-objective. TravelPlanner focuses on travel planning, a complex and realistic scenario involving long-horizon decisions, multiple constraints, and proactive information acquisition. It provides a rich sandbox environment with four million data entries and 1,225 diverse user queries. **Evaluation:** We evaluate five LLMs and four planning strategies on TravelPlanner. State-of-the-art LLMs, including GPT-4, Gemini, and Mixtral, achieve only a 0.6% success rate. Existing planning strategies like ReAct and Reflexion struggle with multi-constraint tasks, often failing to convert reasoning into correct actions. Common failure modes include argument errors, dead loops, and hallucinations. Despite these challenges, the possibility of language agents tackling complex planning tasks is a significant step forward. **Conclusion:** TravelPlanner is a challenging benchmark for evaluating multi-constRAINT planning and tool-use abilities of language agents. Even advanced models achieve only a 0.6% success rate, highlighting the need for further research to enhance agents' performance in complex scenarios.
Reach us at info@study.space