[slides] MineDreamer%3A Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

MineDreamer is an embodied agent designed to follow human-like instructions in a simulated environment, particularly in Minecraft. It uses a novel Chain-of-Imagination (CoI) mechanism to generate precise visual prompts based on current state and instructions, enabling the agent to generate low-level control actions. The agent is built on top of Multimodal Large Language Models (MLLMs) and diffusion models, with three main components: an Imaginator that generates imaginations, a Prompt Generator that creates visual prompts, and a PolicyNet that generates actions. The CoI mechanism allows the agent to break down instructions into multiple stages, enabling it to follow instructions more steadily. Extensive experiments show that MineDreamer outperforms existing agents in following both single and multi-step instructions, achieving nearly double the performance of the best baseline. The agent's ability to generate precise visual prompts based on current state and instructions enables it to generalize well in open-world environments. The method also addresses challenges in instruction-following by using a Goal Drift Collection method to gather data for training. The results demonstrate that the CoI mechanism significantly improves the agent's ability to follow instructions in a simulated environment.MineDreamer is an embodied agent designed to follow human-like instructions in a simulated environment, particularly in Minecraft. It uses a novel Chain-of-Imagination (CoI) mechanism to generate precise visual prompts based on current state and instructions, enabling the agent to generate low-level control actions. The agent is built on top of Multimodal Large Language Models (MLLMs) and diffusion models, with three main components: an Imaginator that generates imaginations, a Prompt Generator that creates visual prompts, and a PolicyNet that generates actions. The CoI mechanism allows the agent to break down instructions into multiple stages, enabling it to follow instructions more steadily. Extensive experiments show that MineDreamer outperforms existing agents in following both single and multi-step instructions, achieving nearly double the performance of the best baseline. The agent's ability to generate precise visual prompts based on current state and instructions enables it to generalize well in open-world environments. The method also addresses challenges in instruction-following by using a Goal Drift Collection method to gather data for training. The results demonstrate that the CoI mechanism significantly improves the agent's ability to follow instructions in a simulated environment.

MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

19 Mar 2024 | Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao