CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

1 Jul 2024 | Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li
CRAB is a novel benchmark framework for evaluating multimodal language model (MLM) agents in cross-environment scenarios. It introduces a graph-based evaluation method and an efficient task construction mechanism, enabling agents to perform tasks across multiple devices and platforms. The framework supports Python interfaces and can be extended to any environment. Crab Benchmark-v0 includes 100 tasks across desktop and mobile environments, evaluating four advanced MLMs under various agent configurations. The results show that the single agent with GPT-4o achieves the highest completion ratio of 35.26%. Crab provides a comprehensive framework for cross-environment task evaluation, incorporating a graph evaluator that decomposes tasks into sub-goals and assesses intermediate steps. This method allows for fine-grained evaluation and accommodates multiple valid pathways to task completion. The framework also includes a sub-task composition method for efficient task construction. Crab benchmark includes both cross-environment and single-environment tasks, covering a wide range of real-world applications. The evaluation metrics include completion ratio, execution efficiency, and cost efficiency, offering a more accurate assessment of agent performance. The framework is publicly available, with all code and datasets accessible on GitHub. The results highlight the importance of ongoing development of more effective autonomous agents.CRAB is a novel benchmark framework for evaluating multimodal language model (MLM) agents in cross-environment scenarios. It introduces a graph-based evaluation method and an efficient task construction mechanism, enabling agents to perform tasks across multiple devices and platforms. The framework supports Python interfaces and can be extended to any environment. Crab Benchmark-v0 includes 100 tasks across desktop and mobile environments, evaluating four advanced MLMs under various agent configurations. The results show that the single agent with GPT-4o achieves the highest completion ratio of 35.26%. Crab provides a comprehensive framework for cross-environment task evaluation, incorporating a graph evaluator that decomposes tasks into sub-goals and assesses intermediate steps. This method allows for fine-grained evaluation and accommodates multiple valid pathways to task completion. The framework also includes a sub-task composition method for efficient task construction. Crab benchmark includes both cross-environment and single-environment tasks, covering a wide range of real-world applications. The evaluation metrics include completion ratio, execution efficiency, and cost efficiency, offering a more accurate assessment of agent performance. The framework is publicly available, with all code and datasets accessible on GitHub. The results highlight the importance of ongoing development of more effective autonomous agents.
Reach us at info@study.space