17 Jun 2024 | Zane Durante, Bidipta Sarkar, Ran Gong, Rohan Taori, Yusuke Noda, Paul Tang, Ehsan Adeli, Shrinidhi Kowshika Lakshmanikant, Kevin Schulman, Arnold Milstein, Demetri Terzopoulos, Ade Famoti, Noboru Kuno, Ashley Llorens, Hoi Vo, Katsu Ikeuchi, Li Fei-Fei, Jianfeng Gao, Naoki Wake, Qiyuan Huang
An Interactive Agent Foundation Model (IAFM) is introduced, designed to process text, visual, and action inputs. The model is trained using a novel multi-task paradigm that unifies diverse pre-training strategies, including visual masked autoencoders, language modeling, and next-action prediction. This approach enables a versatile and adaptable AI framework capable of performing well across various domains. The model is tested in three domains: Robotics, Gaming AI, and Healthcare, demonstrating its ability to generate meaningful and contextually relevant outputs in each area. The strength of the approach lies in its generality, leveraging a variety of data sources for effective multimodal and multi-task learning. The model is trained on a large dataset of 13.4 million video frames from multiple domains, enabling it to engage in interactive multi-modal settings. The model is shown to generalize across different domains, despite using domain-specific inputs. The paper also discusses related work in foundation models, multimodal understanding, and agent-based AI. The model architecture is described, along with the pre-training strategy and experiments on various tasks. The results show that the model performs well in robotics, gaming, and healthcare tasks, demonstrating its effectiveness in action prediction, visual understanding, and natural language-driven interactions. The paper concludes that the IAFM represents a promising avenue for developing generalist, action-taking, multimodal systems.An Interactive Agent Foundation Model (IAFM) is introduced, designed to process text, visual, and action inputs. The model is trained using a novel multi-task paradigm that unifies diverse pre-training strategies, including visual masked autoencoders, language modeling, and next-action prediction. This approach enables a versatile and adaptable AI framework capable of performing well across various domains. The model is tested in three domains: Robotics, Gaming AI, and Healthcare, demonstrating its ability to generate meaningful and contextually relevant outputs in each area. The strength of the approach lies in its generality, leveraging a variety of data sources for effective multimodal and multi-task learning. The model is trained on a large dataset of 13.4 million video frames from multiple domains, enabling it to engage in interactive multi-modal settings. The model is shown to generalize across different domains, despite using domain-specific inputs. The paper also discusses related work in foundation models, multimodal understanding, and agent-based AI. The model architecture is described, along with the pre-training strategy and experiments on various tasks. The results show that the model performs well in robotics, gaming, and healthcare tasks, demonstrating its effectiveness in action prediction, visual understanding, and natural language-driven interactions. The paper concludes that the IAFM represents a promising avenue for developing generalist, action-taking, multimodal systems.