Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

24 May 2024 | Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, Furu Wei
This paper introduces Visualization-of-Thought (VoT) prompting, a method to enhance spatial reasoning in large language models (LLMs) by visualizing their internal reasoning steps. Inspired by the human "mind's eye" ability to create mental images for spatial reasoning, VoT aims to elicit this capability in LLMs by prompting them to visualize their thought processes. The method involves generating visualizations at each reasoning step, which helps guide subsequent reasoning and improves spatial reasoning performance. The study evaluates VoT on three tasks: natural language navigation, visual navigation, and visual tiling. These tasks require understanding spatial relationships, directions, and geometric shapes. The results show that VoT significantly enhances LLMs' spatial reasoning abilities, outperforming existing multimodal large language models (MLLMs) in these tasks. VoT enables LLMs to generate mental images, which are crucial for spatial reasoning, and this ability resembles the human mind's eye process. The paper also explores the effectiveness of VoT across different LLMs, including GPT-4 and GPT-4V, and demonstrates that it improves performance in spatial reasoning tasks. However, the method is not perfect, and performance varies depending on the task difficulty and the model's capabilities. The study highlights the importance of visual state tracking in enhancing spatial reasoning and suggests that future research should focus on improving LLMs' ability to generate and use mental images for spatial reasoning. The findings suggest that VoT has potential for application in MLLMs, as it enables LLMs to create mental images and reason about spatial information in a grounded context.This paper introduces Visualization-of-Thought (VoT) prompting, a method to enhance spatial reasoning in large language models (LLMs) by visualizing their internal reasoning steps. Inspired by the human "mind's eye" ability to create mental images for spatial reasoning, VoT aims to elicit this capability in LLMs by prompting them to visualize their thought processes. The method involves generating visualizations at each reasoning step, which helps guide subsequent reasoning and improves spatial reasoning performance. The study evaluates VoT on three tasks: natural language navigation, visual navigation, and visual tiling. These tasks require understanding spatial relationships, directions, and geometric shapes. The results show that VoT significantly enhances LLMs' spatial reasoning abilities, outperforming existing multimodal large language models (MLLMs) in these tasks. VoT enables LLMs to generate mental images, which are crucial for spatial reasoning, and this ability resembles the human mind's eye process. The paper also explores the effectiveness of VoT across different LLMs, including GPT-4 and GPT-4V, and demonstrates that it improves performance in spatial reasoning tasks. However, the method is not perfect, and performance varies depending on the task difficulty and the model's capabilities. The study highlights the importance of visual state tracking in enhancing spatial reasoning and suggests that future research should focus on improving LLMs' ability to generate and use mental images for spatial reasoning. The findings suggest that VoT has potential for application in MLLMs, as it enables LLMs to create mental images and reason about spatial information in a grounded context.
Reach us at info@study.space
[slides] Mind's Eye of LLMs%3A Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models | StudySpace