[slides and audio] MMInA%3A Benchmarking Multihop Multimodal Internet Agents

MMInA is a benchmark designed to evaluate the capabilities of embodied agents in navigating and completing complex tasks across multiple multimodal websites. The benchmark features 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to extract multimodal information from web pages. Key properties of MMInA include: 1. **Evolving Real-World Multimodal Websites**: The benchmark operates on real-world websites that evolve over time, ensuring high realism and applicability to natural user tasks. 2. **Multihop Web Browsing**: Tasks require information from multiple websites, assessing long-range reasoning capabilities. 3. **Holistic Evaluation**: A novel protocol evaluates agents' progress in completing multihop tasks, with both single-hop and multihop evaluations. Experiments with state-of-the-art agents, including large language models (LLMs) and large multimodal models (LMMs), show that while LLMs perform well on single-hop tasks, they struggle with multihop tasks, achieving a success rate of only 21.8% across all tasks. The main challenge lies in the agents' inability to handle long-chain reasoning and their tendency to fail on early hops, leading to lower task success rates. To address these issues, the authors propose a memory-augmented method that replay past action trajectories to improve agents' performance. This method significantly enhances both single-hop and multihop web browsing abilities, demonstrating the effectiveness of memory augmentation in improving agent performance. The benchmark and evaluation methods provide a comprehensive framework for assessing the capabilities of multihop and multimodal Internet agents, highlighting the need for further advancements in long-range reasoning and multimodal understanding.MMInA is a benchmark designed to evaluate the capabilities of embodied agents in navigating and completing complex tasks across multiple multimodal websites. The benchmark features 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to extract multimodal information from web pages. Key properties of MMInA include: 1. **Evolving Real-World Multimodal Websites**: The benchmark operates on real-world websites that evolve over time, ensuring high realism and applicability to natural user tasks. 2. **Multihop Web Browsing**: Tasks require information from multiple websites, assessing long-range reasoning capabilities. 3. **Holistic Evaluation**: A novel protocol evaluates agents' progress in completing multihop tasks, with both single-hop and multihop evaluations. Experiments with state-of-the-art agents, including large language models (LLMs) and large multimodal models (LMMs), show that while LLMs perform well on single-hop tasks, they struggle with multihop tasks, achieving a success rate of only 21.8% across all tasks. The main challenge lies in the agents' inability to handle long-chain reasoning and their tendency to fail on early hops, leading to lower task success rates. To address these issues, the authors propose a memory-augmented method that replay past action trajectories to improve agents' performance. This method significantly enhances both single-hop and multihop web browsing abilities, demonstrating the effectiveness of memory augmentation in improving agent performance. The benchmark and evaluation methods provide a comprehensive framework for assessing the capabilities of multihop and multimodal Internet agents, highlighting the need for further advancements in long-range reasoning and multimodal understanding.

MMInA: Benchmarking Multihop Multimodal Internet Agents

15 Apr 2024 | Ziniu Zhang, Shulin Tian, Liangyu Chen*,† Ziwei Liu ✉

MMInA: Benchmarking Multihop Multimodal Internet Agents

15 Apr 2024 | Ziniu Zhang*, Shulin Tian*, Liangyu Chen*,† Ziwei Liu ✉

15 Apr 2024 | Ziniu Zhang, Shulin Tian, Liangyu Chen*,† Ziwei Liu ✉