An Interactive Agent Foundation Model

An Interactive Agent Foundation Model

17 Jun 2024 | Zane Durante *12§, Bidipta Sarkar *12§, Ran Gong *23§, Rohan Taori 12§, Yusuke Noda 2, Paul Tang 1, Ehsan Adeli 1, Shrinidhi Kowshika Lakshminathan 1, Kevin Schulman 1, Arnold Milstein 1, Demetri Terzopoulos 3, Ade Famoti 2, Noboru Kuno 2, Ashley Llorens 2, Hoi Vo 2†, Katsu Ikeuchi 2†, Li Fei-Fei 1†, Jianfeng Gao 2‡, Naoki Wake * 2 ►, Qiuyuan Huang * 2 ►
The paper introduces an Interactive Agent Foundation Model (IAFM) designed to handle text, action, and visual inputs, aiming to develop a versatile and adaptable AI framework capable of performing well in various domains. The IAFM uses a novel multi-task agent training paradigm that unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. This approach enables the model to generate contextually relevant outputs across robotics, gaming AI, and healthcare applications. The IAFM is trained on a wide range of datasets, including robotics sequences, gameplay data, large-scale video datasets, and textual information, leveraging a joint image and video encoder. The model is initialized with pre-trained submodules from CLIP and OPT, and it jointly trains these submodules in a unified framework. The training process involves predicting masked tokens across all three modalities, enhancing the model's ability to understand and interact with its environment. The paper evaluates the IAFM's performance across three distinct domains: robotics, gaming AI, and healthcare. In robotics, the model demonstrates effective manipulation tasks using language instructions. In gaming, it predicts actions based on video frames and high-level instructions, showing improved performance over fine-tuning from scratch. In healthcare, the model excels in video captioning, visual question answering, and activity recognition tasks, such as RASS score prediction. The IAFM's generality and adaptability make it a promising tool for developing generalist, action-taking, multimodal systems. The authors plan to release the code and models publicly to facilitate further research in this field.The paper introduces an Interactive Agent Foundation Model (IAFM) designed to handle text, action, and visual inputs, aiming to develop a versatile and adaptable AI framework capable of performing well in various domains. The IAFM uses a novel multi-task agent training paradigm that unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. This approach enables the model to generate contextually relevant outputs across robotics, gaming AI, and healthcare applications. The IAFM is trained on a wide range of datasets, including robotics sequences, gameplay data, large-scale video datasets, and textual information, leveraging a joint image and video encoder. The model is initialized with pre-trained submodules from CLIP and OPT, and it jointly trains these submodules in a unified framework. The training process involves predicting masked tokens across all three modalities, enhancing the model's ability to understand and interact with its environment. The paper evaluates the IAFM's performance across three distinct domains: robotics, gaming AI, and healthcare. In robotics, the model demonstrates effective manipulation tasks using language instructions. In gaming, it predicts actions based on video frames and high-level instructions, showing improved performance over fine-tuning from scratch. In healthcare, the model excels in video captioning, visual question answering, and activity recognition tasks, such as RASS score prediction. The IAFM's generality and adaptability make it a promising tool for developing generalist, action-taking, multimodal systems. The authors plan to release the code and models publicly to facilitate further research in this field.
Reach us at info@study.space
[slides] An Interactive Agent Foundation Model | StudySpace