Understanding m%26m's%3A A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

The paper introduces m&m's, a benchmark for evaluating the tool-use capabilities of large language models (LLMs) in multi-step, multi-modal tasks. The benchmark contains over 4,000 tasks involving 33 tools, including multi-modal models, public APIs, and image processing modules. Each task query is provided with automatically generated plans, and a subset of 1,565 plans is human-verified and executable. The study evaluates 10 popular LLMs using two planning strategies (multi-step vs. step-by-step), two plan formats (JSON vs. code), and three types of feedback (parsing, verification, and execution). Key findings include: 1. Multi-step planning consistently outperforms step-by-step planning, with larger models showing a smaller gap. 2. Verification and execution feedback improve the ability to generate executable plans and predict argument names but can harm tool selection. 3. JSON-format generation produces more executable plans compared to code generation, despite similar performance in tool selection. The paper also discusses the limitations of the benchmark and suggests practical recommendations for designing planners for multi-step, multi-modal tasks. The dataset and evaluation code are available on HuggingFace and Github.The paper introduces m&m's, a benchmark for evaluating the tool-use capabilities of large language models (LLMs) in multi-step, multi-modal tasks. The benchmark contains over 4,000 tasks involving 33 tools, including multi-modal models, public APIs, and image processing modules. Each task query is provided with automatically generated plans, and a subset of 1,565 plans is human-verified and executable. The study evaluates 10 popular LLMs using two planning strategies (multi-step vs. step-by-step), two plan formats (JSON vs. code), and three types of feedback (parsing, verification, and execution). Key findings include: 1. Multi-step planning consistently outperforms step-by-step planning, with larger models showing a smaller gap. 2. Verification and execution feedback improve the ability to generate executable plans and predict argument names but can harm tool selection. 3. JSON-format generation produces more executable plans compared to code generation, despite similar performance in tool selection. The paper also discusses the limitations of the benchmark and suggests practical recommendations for designing planners for multi-step, multi-modal tasks. The dataset and evaluation code are available on HuggingFace and Github.

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

22 Sep 2024 | Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, Ranjay Krishna