[slides] TimeArena%3A Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

TIMEARENA is a novel textual simulation environment designed to evaluate the multitasking efficiency of language agents by incorporating complex temporal dynamics and constraints. The environment includes 30 real-world tasks across cooking, household activities, and laboratory work. Agents must complete multiple tasks as quickly as possible, with actions that may occupy the agent or allow parallel processing. TIMEARENA considers three key factors: time duration and dependencies, agent occupancy, and object occupancy. The environment provides feedback on agent actions, helping to assess performance and simulate realistic scenarios. The simulation evaluates language agents based on four metrics: average progress score, completion speed, task completion rate, and average completion time. Experiments with various state-of-the-art LLMs, including GPT-4, show that even the most advanced models struggle with efficient multitasking, highlighting the need for improved temporal awareness in language agents. Results indicate that multitasking in TIMEARENA is challenging for current LLMs, with many models failing to complete multiple tasks. Open-source models like OpenChat-3.5 and Vicuna-13B perform better than GPT-3.5 but still face difficulties in managing complex action dependencies. The study also explores the impact of resource constraints on multitasking, finding that GPT-4 rarely attempts parallel processing, leading to lower completion rates. Additionally, agents often engage in unnecessary waiting, indicating a lack of awareness of parallel processing capabilities. The research underscores the limitations of current language agents in handling complex, time-sensitive tasks and highlights the need for further research in this area.TIMEARENA is a novel textual simulation environment designed to evaluate the multitasking efficiency of language agents by incorporating complex temporal dynamics and constraints. The environment includes 30 real-world tasks across cooking, household activities, and laboratory work. Agents must complete multiple tasks as quickly as possible, with actions that may occupy the agent or allow parallel processing. TIMEARENA considers three key factors: time duration and dependencies, agent occupancy, and object occupancy. The environment provides feedback on agent actions, helping to assess performance and simulate realistic scenarios. The simulation evaluates language agents based on four metrics: average progress score, completion speed, task completion rate, and average completion time. Experiments with various state-of-the-art LLMs, including GPT-4, show that even the most advanced models struggle with efficient multitasking, highlighting the need for improved temporal awareness in language agents. Results indicate that multitasking in TIMEARENA is challenging for current LLMs, with many models failing to complete multiple tasks. Open-source models like OpenChat-3.5 and Vicuna-13B perform better than GPT-3.5 but still face difficulties in managing complex action dependencies. The study also explores the impact of resource constraints on multitasking, finding that GPT-4 rarely attempts parallel processing, leading to lower completion rates. Additionally, agents often engage in unnecessary waiting, indicating a lack of awareness of parallel processing capabilities. The research underscores the limitations of current language agents in handling complex, time-sensitive tasks and highlights the need for further research in this area.

TIMEARENA: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

8 Feb 2024 | Yikai Zhang, Siyu Yuan, Caiyu Hu, Kyle Richardson, Yanghua Xiao, Jiangjie Chen