2024 | Andi Peng, Ilia Sucholutsky*, Belinda Z. Li*, Theodore R. Sumers, Thomas L. Griffiths, Jacob Andreas, Julie A. Shah
This paper introduces Language-Guided Abstraction (LGA), a method for using natural language to design state abstractions for imitation learning. LGA leverages pre-trained language models (LMs) to automatically generate state representations tailored to unseen tasks. The process begins with a user providing a natural language description of a task, which is then translated into a state abstraction function by an LM. This function identifies task-relevant features and masks out irrelevant ones. An imitation policy is then trained using a small number of demonstrations and the generated abstract states.
Experiments on simulated robotic tasks show that LGA produces state abstractions similar to those designed by humans but in a fraction of the time. These abstractions improve generalization and robustness in the presence of spurious correlations and ambiguous specifications. The method is illustrated on mobile manipulation tasks with a Spot robot.
LGA addresses the challenge of learning generalizable policies in high-dimensional observation spaces by using natural language supervision and background knowledge from LM to automatically build state representations. Unlike traditional methods that require manual specification or extensive labeling, LGA uses only natural language annotations for state features. It complements traditional supervised learning methods like behavior cloning (BC) without relying on additional assumptions about the data labeling process.
LGA outperforms BC and stronger variants of BC in terms of sample efficiency and distributional robustness in both single- and multi-task settings. It matches the performance of human-designed state abstractions while requiring a fraction of the human effort. LGA is particularly effective in handling observational covariate shift and ambiguous linguistic utterances.
The method involves three main steps: textualization, feature abstraction, and instantiation. Textualization transforms raw perceptual inputs into a text-based feature set. Feature abstraction uses an LM to select relevant features for the task. Instantiation transforms the abstracted feature set back into an abstracted perceptual input with only relevant features on display.
LGA is evaluated on three task settings in the VIMA environment: pick-and-place, rotation, and sweeping while avoiding obstacles. The results show that LGA outperforms baselines in terms of task performance and user time spent specifying task-relevant features. Policies trained with LGA state abstractions are more robust to observational shift than GCBC+DART.
LGA is also tested on real-world robotics tasks with a Spot robot, where it successfully completes both tasks. The method is shown to be effective in handling real-world scenarios with complex tasks and ambiguous specifications. The results demonstrate that LGA can flexibly construct state abstractions even if the language utterance was previously unseen. The method is shown to be effective in multi-task settings where policies must adapt to new language specifications.This paper introduces Language-Guided Abstraction (LGA), a method for using natural language to design state abstractions for imitation learning. LGA leverages pre-trained language models (LMs) to automatically generate state representations tailored to unseen tasks. The process begins with a user providing a natural language description of a task, which is then translated into a state abstraction function by an LM. This function identifies task-relevant features and masks out irrelevant ones. An imitation policy is then trained using a small number of demonstrations and the generated abstract states.
Experiments on simulated robotic tasks show that LGA produces state abstractions similar to those designed by humans but in a fraction of the time. These abstractions improve generalization and robustness in the presence of spurious correlations and ambiguous specifications. The method is illustrated on mobile manipulation tasks with a Spot robot.
LGA addresses the challenge of learning generalizable policies in high-dimensional observation spaces by using natural language supervision and background knowledge from LM to automatically build state representations. Unlike traditional methods that require manual specification or extensive labeling, LGA uses only natural language annotations for state features. It complements traditional supervised learning methods like behavior cloning (BC) without relying on additional assumptions about the data labeling process.
LGA outperforms BC and stronger variants of BC in terms of sample efficiency and distributional robustness in both single- and multi-task settings. It matches the performance of human-designed state abstractions while requiring a fraction of the human effort. LGA is particularly effective in handling observational covariate shift and ambiguous linguistic utterances.
The method involves three main steps: textualization, feature abstraction, and instantiation. Textualization transforms raw perceptual inputs into a text-based feature set. Feature abstraction uses an LM to select relevant features for the task. Instantiation transforms the abstracted feature set back into an abstracted perceptual input with only relevant features on display.
LGA is evaluated on three task settings in the VIMA environment: pick-and-place, rotation, and sweeping while avoiding obstacles. The results show that LGA outperforms baselines in terms of task performance and user time spent specifying task-relevant features. Policies trained with LGA state abstractions are more robust to observational shift than GCBC+DART.
LGA is also tested on real-world robotics tasks with a Spot robot, where it successfully completes both tasks. The method is shown to be effective in handling real-world scenarios with complex tasks and ambiguous specifications. The results demonstrate that LGA can flexibly construct state abstractions even if the language utterance was previously unseen. The method is shown to be effective in multi-task settings where policies must adapt to new language specifications.