14 Jun 2024 | Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar
DigiRL is a novel autonomous reinforcement learning (RL) approach for training agents to control devices in real-world scenarios. The method leverages a pre-trained vision-language model (VLM) and employs two stages of training: first, offline RL to initialize the model using existing data, followed by offline-to-online RL to further fine-tune the model using online data. The approach uses a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator to support real-time online RL. The method incorporates advantage-weighted regression (AWR) with an automatic curriculum to maximize learning signal and handle stochasticity in real-world environments. The DigiRL agent achieves a 49.5% absolute improvement in success rate over supervised fine-tuning with static human demonstrations, outperforming prior state-of-the-art agents such as AppAgent with GPT-4V and CogAgent trained with AitW data. The agent also surpasses prior autonomous RL approaches based on filtered behavior cloning. The results demonstrate that DigiRL is effective in learning to control devices in real-world settings, achieving state-of-the-art performance on Android device control tasks. The approach is designed to be scalable and efficient, with the ability to handle the stochasticity and non-stationarity of real-world environments. The method is evaluated on the Android-in-the-Wild (AitW) dataset, where the agent achieves a 28.7% improvement over existing state-of-the-art agents. The results show that DigiRL is a promising approach for training autonomous agents for device control in real-world scenarios.DigiRL is a novel autonomous reinforcement learning (RL) approach for training agents to control devices in real-world scenarios. The method leverages a pre-trained vision-language model (VLM) and employs two stages of training: first, offline RL to initialize the model using existing data, followed by offline-to-online RL to further fine-tune the model using online data. The approach uses a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator to support real-time online RL. The method incorporates advantage-weighted regression (AWR) with an automatic curriculum to maximize learning signal and handle stochasticity in real-world environments. The DigiRL agent achieves a 49.5% absolute improvement in success rate over supervised fine-tuning with static human demonstrations, outperforming prior state-of-the-art agents such as AppAgent with GPT-4V and CogAgent trained with AitW data. The agent also surpasses prior autonomous RL approaches based on filtered behavior cloning. The results demonstrate that DigiRL is effective in learning to control devices in real-world settings, achieving state-of-the-art performance on Android device control tasks. The approach is designed to be scalable and efficient, with the ability to handle the stochasticity and non-stationarity of real-world environments. The method is evaluated on the Android-in-the-Wild (AitW) dataset, where the agent achieves a 28.7% improvement over existing state-of-the-art agents. The results show that DigiRL is a promising approach for training autonomous agents for device control in real-world scenarios.