29 Jan 2019 | Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine
Soft Actor-Critic (SAC) is an off-policy deep reinforcement learning algorithm based on the maximum entropy framework. It aims to maximize both expected return and entropy, encouraging exploration and stability. SAC addresses the challenges of high sample complexity and hyperparameter sensitivity in deep reinforcement learning. The algorithm incorporates a constrained formulation that automatically tunes the temperature hyperparameter, improving training efficiency and stability. SAC outperforms prior on-policy and off-policy methods in sample efficiency and asymptotic performance, achieving state-of-the-art results on benchmark tasks and real-world robotic tasks such as quadrupedal locomotion and dexterous hand manipulation. SAC is stable across different random seeds and demonstrates robustness in real-world applications. The algorithm uses a soft Q-function and a policy network, with entropy maximization to encourage exploration. SAC also includes automatic temperature adjustment, which eliminates the need for manual hyperparameter tuning. The algorithm is evaluated on a range of continuous control tasks and real-world robotic tasks, showing superior performance compared to other methods. SAC's ability to learn robust policies through entropy maximization allows it to generalize to new environments without additional learning. The algorithm is implemented with two soft Q-functions to mitigate bias in policy improvement and is shown to be effective in both simulated and real-world settings. The results indicate that SAC is a promising approach for real-world robotics tasks, offering sample efficiency and stability.Soft Actor-Critic (SAC) is an off-policy deep reinforcement learning algorithm based on the maximum entropy framework. It aims to maximize both expected return and entropy, encouraging exploration and stability. SAC addresses the challenges of high sample complexity and hyperparameter sensitivity in deep reinforcement learning. The algorithm incorporates a constrained formulation that automatically tunes the temperature hyperparameter, improving training efficiency and stability. SAC outperforms prior on-policy and off-policy methods in sample efficiency and asymptotic performance, achieving state-of-the-art results on benchmark tasks and real-world robotic tasks such as quadrupedal locomotion and dexterous hand manipulation. SAC is stable across different random seeds and demonstrates robustness in real-world applications. The algorithm uses a soft Q-function and a policy network, with entropy maximization to encourage exploration. SAC also includes automatic temperature adjustment, which eliminates the need for manual hyperparameter tuning. The algorithm is evaluated on a range of continuous control tasks and real-world robotic tasks, showing superior performance compared to other methods. SAC's ability to learn robust policies through entropy maximization allows it to generalize to new environments without additional learning. The algorithm is implemented with two soft Q-functions to mitigate bias in policy improvement and is shown to be effective in both simulated and real-world settings. The results indicate that SAC is a promising approach for real-world robotics tasks, offering sample efficiency and stability.