Understanding Mind's Eye of LLMs%3A Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

The paper "Mind’s Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models" by Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei from Microsoft Research explores the potential of large language models (LLMs) in spatial reasoning, a critical aspect of human cognition. Inspired by the human ability to create mental images through the "Mind's Eye," the authors propose Visualization-of-Thought (VoT) prompting to enhance LLMs' spatial reasoning capabilities. VoT aims to elicit spatial reasoning by visualizing the reasoning traces of LLMs, guiding subsequent reasoning steps. The study evaluates VoT on multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results show that VoT significantly improves LLMs' performance in these tasks, outperforming existing multimodal large language models (MLLMs). The paper highlights the following key contributions: 1. **Cognitive Perspective on LLMs' Mental Images**: It provides a cognitive analysis of LLMs' mental images and their limitations. 2. **Development of Visual Navigation and Visual Tiling Tasks**: Two new tasks are introduced to emulate human-like multisensory perception, offering a well-designed testbed for spatial reasoning research. 3. **Empirical Evaluations of VoT**: VoT is evaluated on three tasks, demonstrating its effectiveness compared to other prompting methods and MLLMs. The authors also discuss the limitations of the current work, such as the sensitivity of visual state tracking to prompts and the need for more diverse and complex representations. Future work will focus on automatic data augmentation from real-world scenarios and improving the mind's eye of LLMs. Overall, the study suggests that VoT has the potential to enhance LLMs' spatial reasoning abilities, mirroring the mind's eye process, and could be a promising direction for advancing MLLMs.The paper "Mind’s Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models" by Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei from Microsoft Research explores the potential of large language models (LLMs) in spatial reasoning, a critical aspect of human cognition. Inspired by the human ability to create mental images through the "Mind's Eye," the authors propose Visualization-of-Thought (VoT) prompting to enhance LLMs' spatial reasoning capabilities. VoT aims to elicit spatial reasoning by visualizing the reasoning traces of LLMs, guiding subsequent reasoning steps. The study evaluates VoT on multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results show that VoT significantly improves LLMs' performance in these tasks, outperforming existing multimodal large language models (MLLMs). The paper highlights the following key contributions: 1. **Cognitive Perspective on LLMs' Mental Images**: It provides a cognitive analysis of LLMs' mental images and their limitations. 2. **Development of Visual Navigation and Visual Tiling Tasks**: Two new tasks are introduced to emulate human-like multisensory perception, offering a well-designed testbed for spatial reasoning research. 3. **Empirical Evaluations of VoT**: VoT is evaluated on three tasks, demonstrating its effectiveness compared to other prompting methods and MLLMs. The authors also discuss the limitations of the current work, such as the sensitivity of visual state tracking to prompts and the need for more diverse and complex representations. Future work will focus on automatic data augmentation from real-world scenarios and improving the mind's eye of LLMs. Overall, the study suggests that VoT has the potential to enhance LLMs' spatial reasoning abilities, mirroring the mind's eye process, and could be a promising direction for advancing MLLMs.

Mind’s Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

24 May 2024 | Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, Furu Wei