1 Jul 2024 | Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li
The paper introduces *Crab*, a novel benchmark framework designed to evaluate Multimodal Language Models (MLMs) in cross-environment tasks. *Crab* addresses the limitations of existing benchmarks by incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. The framework supports multiple devices and can be extended to any environment with a Python interface. Leveraging *Crab*, the authors developed *Crab Benchmark-v0*, which includes 100 tasks across computer desktop and mobile phone environments. Four advanced MLMs were evaluated using different single and multi-agent system configurations. The results show that the single agent with GPT-4o achieves the best completion ratio of 35.26%. The framework, code, agent code, and task datasets are publicly available at <https://github.com/camel-ai/crab>.The paper introduces *Crab*, a novel benchmark framework designed to evaluate Multimodal Language Models (MLMs) in cross-environment tasks. *Crab* addresses the limitations of existing benchmarks by incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. The framework supports multiple devices and can be extended to any environment with a Python interface. Leveraging *Crab*, the authors developed *Crab Benchmark-v0*, which includes 100 tasks across computer desktop and mobile phone environments. Four advanced MLMs were evaluated using different single and multi-agent system configurations. The results show that the single agent with GPT-4o achieves the best completion ratio of 35.26%. The framework, code, agent code, and task datasets are publicly available at <https://github.com/camel-ai/crab>.