MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

11 Apr 2025 | Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, Shenghua Gao
MLLM-Tool is a multimodal large language model designed to enable tool agent learning by integrating open-source LLMs with multimodal encoders. The system allows the learned LLMs to perceive multi-modal input instructions and select the function-matched tool correctly. To evaluate the model's capability, a dataset named ToolMMBench is collected from HuggingFace, featuring multi-modal input tools and multiple potential choices for the same instruction. The experiments reveal that MLLM-Tool can recommend appropriate tools for multi-modal instructions. The system is capable of processing four types of input: text, text+image, text+video, and text+audio. The model uses ImageBind as the primary multimodal encoder and fine-tunes an open-source LLM with Low-Rank Adaptation (LoRA) to optimize performance. Evaluation metrics include accuracy, hallucination rate, and format accuracy. The results show that MLLM-Tool achieves a high accuracy of 88.19% in tool selection. The system is effective in handling multi-modal instruction inputs and enhances LLMs' ability to deal with ambiguous queries. The dataset includes 932 high-quality APIs and one "Unknown" category. The system is evaluated on various conditions, including ambiguity types, multiple options, and individual modality. The results demonstrate that MLLM-Tool performs well in handling multi-modal inputs and selecting the appropriate tools. The system is also compared with other models, showing its effectiveness in tool selection and handling multi-modal instructions. The dataset and code are available for further research and development.MLLM-Tool is a multimodal large language model designed to enable tool agent learning by integrating open-source LLMs with multimodal encoders. The system allows the learned LLMs to perceive multi-modal input instructions and select the function-matched tool correctly. To evaluate the model's capability, a dataset named ToolMMBench is collected from HuggingFace, featuring multi-modal input tools and multiple potential choices for the same instruction. The experiments reveal that MLLM-Tool can recommend appropriate tools for multi-modal instructions. The system is capable of processing four types of input: text, text+image, text+video, and text+audio. The model uses ImageBind as the primary multimodal encoder and fine-tunes an open-source LLM with Low-Rank Adaptation (LoRA) to optimize performance. Evaluation metrics include accuracy, hallucination rate, and format accuracy. The results show that MLLM-Tool achieves a high accuracy of 88.19% in tool selection. The system is effective in handling multi-modal instruction inputs and enhances LLMs' ability to deal with ambiguous queries. The dataset includes 932 high-quality APIs and one "Unknown" category. The system is evaluated on various conditions, including ambiguity types, multiple options, and individual modality. The results demonstrate that MLLM-Tool performs well in handling multi-modal inputs and selecting the appropriate tools. The system is also compared with other models, showing its effectiveness in tool selection and handling multi-modal instructions. The dataset and code are available for further research and development.
Reach us at info@study.space
[slides] MLLM-Tool%3A A Multimodal Large Language Model for Tool Agent Learning | StudySpace