The paper introduces m&m's, a benchmark for evaluating tool-use in multi-step, multi-modal tasks. It contains 4,000+ tasks involving 33 tools, including multi-modal models, public APIs, and image processing modules. The benchmark includes 1,565 human-verified, executable plans. The authors evaluate 10 LLMs with two planning strategies (multi-step vs. step-by-step), two plan formats (JSON vs. code), and three types of feedback (parsing, verification, execution). Results show that multi-step planning with JSON and feedback yields the best tool-use performance. Verification and execution feedback improve plan execution and argument name prediction but may slightly reduce tool selection accuracy. JSON format generally produces more executable plans than code, though code LLMs perform better in this aspect. The benchmark supports a wide range of planning strategies and feedback mechanisms, enabling a systematic study of LLMs for multi-step, multi-modal tasks. The dataset and evaluation code are available on HuggingFace and GitHub.The paper introduces m&m's, a benchmark for evaluating tool-use in multi-step, multi-modal tasks. It contains 4,000+ tasks involving 33 tools, including multi-modal models, public APIs, and image processing modules. The benchmark includes 1,565 human-verified, executable plans. The authors evaluate 10 LLMs with two planning strategies (multi-step vs. step-by-step), two plan formats (JSON vs. code), and three types of feedback (parsing, verification, execution). Results show that multi-step planning with JSON and feedback yields the best tool-use performance. Verification and execution feedback improve plan execution and argument name prediction but may slightly reduce tool selection accuracy. JSON format generally produces more executable plans than code, though code LLMs perform better in this aspect. The benchmark supports a wide range of planning strategies and feedback mechanisms, enabling a systematic study of LLMs for multi-step, multi-modal tasks. The dataset and evaluation code are available on HuggingFace and GitHub.