7 Jun 2024 | Robert McCarthy, Daniel C.H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibhin Li
This survey explores methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics. It focuses on methods that can scale to large internet video datasets and extract foundational knowledge about the world's dynamics and human behavior. These methods hold promise for developing general-purpose robots. The survey begins with an overview of fundamental concepts relevant to LfV for robotics, including benefits and challenges. It then reviews video foundation model techniques for extracting knowledge from large, heterogeneous video datasets and methods that leverage video data for robot learning. The survey also highlights techniques for mitigating LfV challenges, such as action representations for missing action labels in video. It examines LfV datasets and benchmarks, and concludes with a discussion of challenges and opportunities in LfV. The survey advocates for scalable foundation model approaches that can leverage internet video data to learn key RL knowledge modalities, such as policies and dynamics models. The survey provides a comprehensive reference for the emerging field of LfV, aiming to catalyze further research and progress towards general-purpose robots. The survey structure includes background, LfV for robotics, video foundation models, LfV methods, datasets, benchmarks, challenges and opportunities, and a conclusion. Key contributions include advocating for LfV, formalizing fundamental concepts, enumerating and taxonomizing literature, conducting critical analysis, and identifying key challenges and opportunities. The survey discusses the potential benefits of LfV, including improved generalization, data efficiency, and emergent capabilities, as well as challenges such as missing action labels, distribution shifts, and low-level information. The survey also evaluates LfV methods based on scalability, downstream performance gains, and other criteria. The survey highlights the importance of video foundation models in LfV and their potential applications in robotics.This survey explores methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics. It focuses on methods that can scale to large internet video datasets and extract foundational knowledge about the world's dynamics and human behavior. These methods hold promise for developing general-purpose robots. The survey begins with an overview of fundamental concepts relevant to LfV for robotics, including benefits and challenges. It then reviews video foundation model techniques for extracting knowledge from large, heterogeneous video datasets and methods that leverage video data for robot learning. The survey also highlights techniques for mitigating LfV challenges, such as action representations for missing action labels in video. It examines LfV datasets and benchmarks, and concludes with a discussion of challenges and opportunities in LfV. The survey advocates for scalable foundation model approaches that can leverage internet video data to learn key RL knowledge modalities, such as policies and dynamics models. The survey provides a comprehensive reference for the emerging field of LfV, aiming to catalyze further research and progress towards general-purpose robots. The survey structure includes background, LfV for robotics, video foundation models, LfV methods, datasets, benchmarks, challenges and opportunities, and a conclusion. Key contributions include advocating for LfV, formalizing fundamental concepts, enumerating and taxonomizing literature, conducting critical analysis, and identifying key challenges and opportunities. The survey discusses the potential benefits of LfV, including improved generalization, data efficiency, and emergent capabilities, as well as challenges such as missing action labels, distribution shifts, and low-level information. The survey also evaluates LfV methods based on scalability, downstream performance gains, and other criteria. The survey highlights the importance of video foundation models in LfV and their potential applications in robotics.