MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

19 Mar 2024 | Enshen Zhou1,2*, Yiran Qin1,3*, Zhenfei Yin1,4, Yuzhou Huang3†, Ruimao Zhang3†, Lu Sheng2†, Yu Qiao1, Jing Shao1†
**Abstract:** The paper introduces *MineDreamer*, an open-ended embodied agent designed to follow diverse instructions in a human-like manner. *MineDreamer* is built on the Minecraft simulator and employs a Chain-of-Imagination (CoI) mechanism to enhance its instruction-following ability. The CoI mechanism breaks down instructions into multiple stages, allowing the agent to generate precise visual prompts tailored to the current state. The agent then uses these prompts to generate low-level control actions, steadily following the instructions. Extensive experiments demonstrate that *MineDreamer* outperforms existing generalist agents, achieving nearly double the performance in executing single and multi-step instructions. Qualitative analysis reveals the agent's ability to generalize and comprehend complex environments. **Keywords:** Chain-of-Imagination · multimodal large language model · instruction following · low-level control **Introduction:** The paper addresses the challenge of designing a generalist embodied agent that can follow diverse instructions in an open-world environment. Existing approaches often struggle with understanding abstract and sequential natural language instructions, leading to inconsistent performance. *MineDreamer* leverages recent advances in Multimodal Large Language Models (MLLMs) and diffusion models to enhance its instruction-following ability. The CoI mechanism enables the agent to imagine and act upon the next stage of a task, ensuring steady action generation. The agent consists of three modules: an Imaginator, a Prompt Generator, and a PolicyNet. The Imaginator generates imaginations that adhere to physical rules and environmental understanding, while the Prompt Generator converts these imaginations into precise visual prompts. The PolicyNet uses these prompts to predict actions, guided by the current state and instruction. **Related Work:** The paper discusses existing work on embodied agents in Minecraft and conditioned diffusion models. It highlights the importance of understanding and following instructions in complex environments, and how *MineDreamer* addresses these challenges through its innovative CoI mechanism. **Experiments:** The paper evaluates *MineDreamer* using various datasets and baselines, demonstrating its superior performance in both programmatic and command-switching evaluations. The agent consistently outperforms unconditional models and those that ignore the current state, showcasing its ability to follow instructions steadily and adapt to new tasks. **Conclusion:** *MineDreamer* demonstrates a novel paradigm for enhancing instruction-following abilities in simulated-world control. The agent's strong performance in Minecraft highlights its potential as a high-level planner's downstream controller. The paper concludes by discussing limitations and future directions, emphasizing the importance of speed and hallucination reduction.**Abstract:** The paper introduces *MineDreamer*, an open-ended embodied agent designed to follow diverse instructions in a human-like manner. *MineDreamer* is built on the Minecraft simulator and employs a Chain-of-Imagination (CoI) mechanism to enhance its instruction-following ability. The CoI mechanism breaks down instructions into multiple stages, allowing the agent to generate precise visual prompts tailored to the current state. The agent then uses these prompts to generate low-level control actions, steadily following the instructions. Extensive experiments demonstrate that *MineDreamer* outperforms existing generalist agents, achieving nearly double the performance in executing single and multi-step instructions. Qualitative analysis reveals the agent's ability to generalize and comprehend complex environments. **Keywords:** Chain-of-Imagination · multimodal large language model · instruction following · low-level control **Introduction:** The paper addresses the challenge of designing a generalist embodied agent that can follow diverse instructions in an open-world environment. Existing approaches often struggle with understanding abstract and sequential natural language instructions, leading to inconsistent performance. *MineDreamer* leverages recent advances in Multimodal Large Language Models (MLLMs) and diffusion models to enhance its instruction-following ability. The CoI mechanism enables the agent to imagine and act upon the next stage of a task, ensuring steady action generation. The agent consists of three modules: an Imaginator, a Prompt Generator, and a PolicyNet. The Imaginator generates imaginations that adhere to physical rules and environmental understanding, while the Prompt Generator converts these imaginations into precise visual prompts. The PolicyNet uses these prompts to predict actions, guided by the current state and instruction. **Related Work:** The paper discusses existing work on embodied agents in Minecraft and conditioned diffusion models. It highlights the importance of understanding and following instructions in complex environments, and how *MineDreamer* addresses these challenges through its innovative CoI mechanism. **Experiments:** The paper evaluates *MineDreamer* using various datasets and baselines, demonstrating its superior performance in both programmatic and command-switching evaluations. The agent consistently outperforms unconditional models and those that ignore the current state, showcasing its ability to follow instructions steadily and adapt to new tasks. **Conclusion:** *MineDreamer* demonstrates a novel paradigm for enhancing instruction-following abilities in simulated-world control. The agent's strong performance in Minecraft highlights its potential as a high-level planner's downstream controller. The paper concludes by discussing limitations and future directions, emphasizing the importance of speed and hallucination reduction.
Reach us at info@study.space