MMInA is a benchmark for evaluating embodied agents in multihop, multimodal web tasks. It features evolving real-world websites, naturally compositional tasks requiring information from multiple sites, and a holistic evaluation protocol. The benchmark includes 1,050 human-written tasks across 14 websites, with an average of 2.85 hops and 12.9 actions per task. It assesses agents' ability to navigate and complete complex tasks, highlighting challenges in long-chain reasoning. Existing benchmarks fail to evaluate agents in realistic, evolving environments. MMInA introduces a memory-augmented approach to improve agent performance by replaying past action trajectories. Experiments show that while humans achieve high success rates, state-of-the-art agents struggle with multihop tasks, particularly in early hops. The benchmark evaluates both single-hop and multihop tasks, with results indicating that agents perform better on tasks with fewer hops. MMInA provides a flexible, realistic environment for assessing multimodal and multihop reasoning capabilities, emphasizing the need for advanced planning and execution in web agents. The benchmark includes a novel holistic evaluation method, assessing task and hop success rates. Memory-augmented agents show improved performance by leveraging past action trajectories. The benchmark highlights the challenges of multihop tasks, including search space complexity and limited memory. Future work includes expanding to mobile platforms and introducing long-term memory mechanisms. MMInA provides a testing ground for agent research and identifies key challenges in web navigation and decision-making.MMInA is a benchmark for evaluating embodied agents in multihop, multimodal web tasks. It features evolving real-world websites, naturally compositional tasks requiring information from multiple sites, and a holistic evaluation protocol. The benchmark includes 1,050 human-written tasks across 14 websites, with an average of 2.85 hops and 12.9 actions per task. It assesses agents' ability to navigate and complete complex tasks, highlighting challenges in long-chain reasoning. Existing benchmarks fail to evaluate agents in realistic, evolving environments. MMInA introduces a memory-augmented approach to improve agent performance by replaying past action trajectories. Experiments show that while humans achieve high success rates, state-of-the-art agents struggle with multihop tasks, particularly in early hops. The benchmark evaluates both single-hop and multihop tasks, with results indicating that agents perform better on tasks with fewer hops. MMInA provides a flexible, realistic environment for assessing multimodal and multihop reasoning capabilities, emphasizing the need for advanced planning and execution in web agents. The benchmark includes a novel holistic evaluation method, assessing task and hop success rates. Memory-augmented agents show improved performance by leveraging past action trajectories. The benchmark highlights the challenges of multihop tasks, including search space complexity and limited memory. Future work includes expanding to mobile platforms and introducing long-term memory mechanisms. MMInA provides a testing ground for agent research and identifies key challenges in web navigation and decision-making.