MMInA is a benchmark designed to evaluate the capabilities of embodied agents in navigating and completing complex tasks across multiple multimodal websites. The benchmark features 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to extract multimodal information from web pages. Key properties of MMInA include:
1. **Evolving Real-World Multimodal Websites**: The benchmark operates on real-world websites that evolve over time, ensuring high realism and applicability to natural user tasks.
2. **Multihop Web Browsing**: Tasks require information from multiple websites, assessing long-range reasoning capabilities.
3. **Holistic Evaluation**: A novel protocol evaluates agents' progress in completing multihop tasks, with both single-hop and multihop evaluations.
Experiments with state-of-the-art agents, including large language models (LLMs) and large multimodal models (LMMs), show that while LLMs perform well on single-hop tasks, they struggle with multihop tasks, achieving a success rate of only 21.8% across all tasks. The main challenge lies in the agents' inability to handle long-chain reasoning and their tendency to fail on early hops, leading to lower task success rates.
To address these issues, the authors propose a memory-augmented method that replay past action trajectories to improve agents' performance. This method significantly enhances both single-hop and multihop web browsing abilities, demonstrating the effectiveness of memory augmentation in improving agent performance.
The benchmark and evaluation methods provide a comprehensive framework for assessing the capabilities of multihop and multimodal Internet agents, highlighting the need for further advancements in long-range reasoning and multimodal understanding.MMInA is a benchmark designed to evaluate the capabilities of embodied agents in navigating and completing complex tasks across multiple multimodal websites. The benchmark features 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to extract multimodal information from web pages. Key properties of MMInA include:
1. **Evolving Real-World Multimodal Websites**: The benchmark operates on real-world websites that evolve over time, ensuring high realism and applicability to natural user tasks.
2. **Multihop Web Browsing**: Tasks require information from multiple websites, assessing long-range reasoning capabilities.
3. **Holistic Evaluation**: A novel protocol evaluates agents' progress in completing multihop tasks, with both single-hop and multihop evaluations.
Experiments with state-of-the-art agents, including large language models (LLMs) and large multimodal models (LMMs), show that while LLMs perform well on single-hop tasks, they struggle with multihop tasks, achieving a success rate of only 21.8% across all tasks. The main challenge lies in the agents' inability to handle long-chain reasoning and their tendency to fail on early hops, leading to lower task success rates.
To address these issues, the authors propose a memory-augmented method that replay past action trajectories to improve agents' performance. This method significantly enhances both single-hop and multihop web browsing abilities, demonstrating the effectiveness of memory augmentation in improving agent performance.
The benchmark and evaluation methods provide a comprehensive framework for assessing the capabilities of multihop and multimodal Internet agents, highlighting the need for further advancements in long-range reasoning and multimodal understanding.