[slides and audio] SPIN%3A Simultaneous Perception%2C Interaction and Navigation

SPIN: Simultaneous Perception, Interaction, and Navigation This paper presents SPIN, an end-to-end approach to Simultaneous Perception, Interaction, and Navigation. The goal is to enable a robot to simultaneously perceive, manipulate, and navigate in cluttered environments. The robot uses an active visual system to perceive and react to its environment. The approach is based on a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. The robot is able to navigate around complex cluttered scenarios while displaying agile whole-body coordination using only ego-vision without needing to create environment maps. The main challenge in mobile manipulation is to coordinate the robot's base and arm, rely on onboard perception for perceiving and interacting with the environment, and integrate all these parts together. Prior works approach the problem using disentangled modular skills for mobility and manipulation that are trivially tied together. This causes several limitations such as compounding errors, delays in decision-making, and no whole-body coordination. In this work, we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. The robot is trained using reinforcement learning (RL) and to get around the computational bottleneck of rendering depth images, we use a teacher-student training framework where robot behavior is first learned using RL with access to visible object scandots and then distilled into a policy that operates from ego-depth using supervised learning. We evaluate across 6 benchmarks in simulation ranging from easy, medium, and hard difficulty, and two real-world environments with a similar level of clutter as the hard environments in simulation and also add dynamic, adversarial obstacles. We find that our method outperforms classical methods and baselines which do not use active vision. We also observe emergent behaviors, including dynamic obstacle avoidance which the robot did not see during training time. Our approach presents a radical hypothesis that the traditionally non-reactive planning approach to whole-body control can indeed be cast into a reactive model – i.e. – single end-to-end policy trained by RL. Despite a big departure from optimal control literature, this hypothesis is not as surprising since agile whole-body coordination and fast obstacle avoidance in humans are developed into muscle memory over time. We now discuss our approach in detail. We want our mobile manipulator to navigate and manipulate objects while avoiding obstacles in cluttered environments. It shares anatomical similarities with a human, bringing with it many of the same challenges. First, it has a limb in the form of an arm that can be raised and lowered, so the robot must constantly move the arm to avoid any obstacles. Second, it has an actuated camera with a very limited field of view (87 degrees horizontal, 58 degrees vertical), so it needs to constantly look around to simultaneously plan ahead and look out for unexpected obstacles. Imagine yourself walking through a cluttered cabinet, there are too many obstacles around to keep track of, and you can't see allSPIN: Simultaneous Perception, Interaction, and Navigation This paper presents SPIN, an end-to-end approach to Simultaneous Perception, Interaction, and Navigation. The goal is to enable a robot to simultaneously perceive, manipulate, and navigate in cluttered environments. The robot uses an active visual system to perceive and react to its environment. The approach is based on a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. The robot is able to navigate around complex cluttered scenarios while displaying agile whole-body coordination using only ego-vision without needing to create environment maps. The main challenge in mobile manipulation is to coordinate the robot's base and arm, rely on onboard perception for perceiving and interacting with the environment, and integrate all these parts together. Prior works approach the problem using disentangled modular skills for mobility and manipulation that are trivially tied together. This causes several limitations such as compounding errors, delays in decision-making, and no whole-body coordination. In this work, we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. The robot is trained using reinforcement learning (RL) and to get around the computational bottleneck of rendering depth images, we use a teacher-student training framework where robot behavior is first learned using RL with access to visible object scandots and then distilled into a policy that operates from ego-depth using supervised learning. We evaluate across 6 benchmarks in simulation ranging from easy, medium, and hard difficulty, and two real-world environments with a similar level of clutter as the hard environments in simulation and also add dynamic, adversarial obstacles. We find that our method outperforms classical methods and baselines which do not use active vision. We also observe emergent behaviors, including dynamic obstacle avoidance which the robot did not see during training time. Our approach presents a radical hypothesis that the traditionally non-reactive planning approach to whole-body control can indeed be cast into a reactive model – i.e. – single end-to-end policy trained by RL. Despite a big departure from optimal control literature, this hypothesis is not as surprising since agile whole-body coordination and fast obstacle avoidance in humans are developed into muscle memory over time. We now discuss our approach in detail. We want our mobile manipulator to navigate and manipulate objects while avoiding obstacles in cluttered environments. It shares anatomical similarities with a human, bringing with it many of the same challenges. First, it has a limb in the form of an arm that can be raised and lowered, so the robot must constantly move the arm to avoid any obstacles. Second, it has an actuated camera with a very limited field of view (87 degrees horizontal, 58 degrees vertical), so it needs to constantly look around to simultaneously plan ahead and look out for unexpected obstacles. Imagine yourself walking through a cluttered cabinet, there are too many obstacles around to keep track of, and you can't see all

SPIN: Simultaneous Perception, Interaction and Navigation

13 May 2024 | Shagun Uppal, Ananye Agarwal, Haoyu Xiong, Kenneth Shaw, Deepak Pathak