Understanding MLLM-Tool%3A A Multimodal Large Language Model for Tool Agent Learning

MLLM-Tool is a multimodal large language model designed to enhance the ability of LLMs to perceive and select tools based on multi-modal instructions. The system integrates open-source LLMs with multi-modal encoders to enable the model to process visual and auditory information, improving its capability to understand and respond to complex user requests. The research addresses the limitations of traditional LLMs in handling ambiguous or multi-modal user instructions, which often result in incorrect tool selection. To evaluate the model's performance, a dataset was collected from HuggingFace, featuring a variety of multi-modal inputs and multiple potential API options for each instruction. The dataset includes 932 high-quality APIs, with a focus on text, image, video, and audio modalities. The system was tested on various scenarios, including ambiguous instructions, multiple API options, and individual modalities, demonstrating its effectiveness in accurately recommending appropriate tools. The experiments showed that MLLM-Tool outperformed other models in terms of accuracy, hallucination rate, and format accuracy, highlighting its potential in enhancing LLMs for tool agent learning. The results indicate that incorporating multi-modal information significantly improves the model's ability to understand user intentions and select the correct tools, making it a valuable advancement in the field of large language models.MLLM-Tool is a multimodal large language model designed to enhance the ability of LLMs to perceive and select tools based on multi-modal instructions. The system integrates open-source LLMs with multi-modal encoders to enable the model to process visual and auditory information, improving its capability to understand and respond to complex user requests. The research addresses the limitations of traditional LLMs in handling ambiguous or multi-modal user instructions, which often result in incorrect tool selection. To evaluate the model's performance, a dataset was collected from HuggingFace, featuring a variety of multi-modal inputs and multiple potential API options for each instruction. The dataset includes 932 high-quality APIs, with a focus on text, image, video, and audio modalities. The system was tested on various scenarios, including ambiguous instructions, multiple API options, and individual modalities, demonstrating its effectiveness in accurately recommending appropriate tools. The experiments showed that MLLM-Tool outperformed other models in terms of accuracy, hallucination rate, and format accuracy, highlighting its potential in enhancing LLMs for tool agent learning. The results indicate that incorporating multi-modal information significantly improves the model's ability to understand user intentions and select the correct tools, making it a valuable advancement in the field of large language models.

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

11 Apr 2025 | Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, Shenghua Gao