A Survey of Robotic Language Grounding: Tradeoffs between Symbols and Embeddings

A Survey of Robotic Language Grounding: Tradeoffs between Symbols and Embeddings

22 Jun 2024 | Vanya Cohen, Jason Xinyu Liu, Raymond Mooney, Stefanie Tellex, David Watkins
This survey explores the tradeoffs between symbolic representations and high-dimensional embeddings in robotic language grounding. It reviews recent work that maps natural language to robot behavior, situating approaches along a spectrum between formal symbolic representations and high-dimensional vector spaces. Symbolic representations allow precise meaning representation, limit learning complexity, and enable interpretability and safety guarantees. However, they constrain model flexibility and expressive power. High-dimensional embeddings avoid symbolic structure, enabling broader generalization with more data but requiring more training resources. The survey discusses the benefits and tradeoffs of each approach and suggests future research directions that combine the strengths of both. The survey evaluates methods that map natural language to formal representations, such as temporal logic, planning domain definition language (PDDL), and code. These methods often use symbolic planners to generate robot actions. Recent works leverage large language models (LLMs) to translate natural language to formal representations, improving performance in tasks like navigation and manipulation. For example, Lang2LTL uses LLMs to ground navigation commands to linear temporal logic (LTL) formulas, while AutoTAMP uses LLMs to translate task descriptions to signal temporal logic (STL) formulas and generate trajectories. On the other end of the spectrum, methods map natural language to high-dimensional embeddings, often using end-to-end neural networks. These approaches require large datasets and computational resources but can generalize better and are more flexible in representing user intent. Examples include VIMA, which maps vision and language instructions to low-level robot actions, and VPT, which uses video pretraining to learn robot actions. The survey also discusses the challenges of end-to-end approaches, including data collection, generalization, and safety. It highlights the importance of interpretability and safety in robotic systems, noting that formal methods provide stronger guarantees but may lack flexibility. The survey concludes that future research should aim to combine the strengths of both symbolic and end-to-end approaches to achieve more robust and interpretable robotic systems.This survey explores the tradeoffs between symbolic representations and high-dimensional embeddings in robotic language grounding. It reviews recent work that maps natural language to robot behavior, situating approaches along a spectrum between formal symbolic representations and high-dimensional vector spaces. Symbolic representations allow precise meaning representation, limit learning complexity, and enable interpretability and safety guarantees. However, they constrain model flexibility and expressive power. High-dimensional embeddings avoid symbolic structure, enabling broader generalization with more data but requiring more training resources. The survey discusses the benefits and tradeoffs of each approach and suggests future research directions that combine the strengths of both. The survey evaluates methods that map natural language to formal representations, such as temporal logic, planning domain definition language (PDDL), and code. These methods often use symbolic planners to generate robot actions. Recent works leverage large language models (LLMs) to translate natural language to formal representations, improving performance in tasks like navigation and manipulation. For example, Lang2LTL uses LLMs to ground navigation commands to linear temporal logic (LTL) formulas, while AutoTAMP uses LLMs to translate task descriptions to signal temporal logic (STL) formulas and generate trajectories. On the other end of the spectrum, methods map natural language to high-dimensional embeddings, often using end-to-end neural networks. These approaches require large datasets and computational resources but can generalize better and are more flexible in representing user intent. Examples include VIMA, which maps vision and language instructions to low-level robot actions, and VPT, which uses video pretraining to learn robot actions. The survey also discusses the challenges of end-to-end approaches, including data collection, generalization, and safety. It highlights the importance of interpretability and safety in robotic systems, noting that formal methods provide stronger guarantees but may lack flexibility. The survey concludes that future research should aim to combine the strengths of both symbolic and end-to-end approaches to achieve more robust and interpretable robotic systems.
Reach us at info@study.space