30 Jan 2019 | Peter Henderson1*, Riashat Islam1,2*, Philip Bachman2 Joelle Pineau1, Doina Precup1, David Meger1
This paper addresses the challenges of reproducibility in deep reinforcement learning (RL). It highlights the difficulties in reproducing results due to non-determinism in standard benchmark environments and the variance intrinsic to the methods. The authors investigate various sources of variability, including hyperparameters, random seeds, environment characteristics, and codebases. They find that even small changes in hyperparameters can significantly affect performance, and that the choice of network architecture and activation functions can also have a substantial impact. The paper also examines the effects of reward scaling and random seeds on algorithm performance, emphasizing the need for proper reporting of experimental details to ensure reproducibility. Additionally, it discusses the importance of evaluating algorithms across a wide range of environments and the need for significance testing to assess the reliability of reported results. The authors propose guidelines and recommendations to improve reproducibility in deep RL, including the use of bootstrapping and power analysis to determine the necessary number of trials for meaningful comparisons. They conclude by discussing future directions, such as building hyperparameter-agnostic algorithms and further investigating significance metrics for RL algorithms.This paper addresses the challenges of reproducibility in deep reinforcement learning (RL). It highlights the difficulties in reproducing results due to non-determinism in standard benchmark environments and the variance intrinsic to the methods. The authors investigate various sources of variability, including hyperparameters, random seeds, environment characteristics, and codebases. They find that even small changes in hyperparameters can significantly affect performance, and that the choice of network architecture and activation functions can also have a substantial impact. The paper also examines the effects of reward scaling and random seeds on algorithm performance, emphasizing the need for proper reporting of experimental details to ensure reproducibility. Additionally, it discusses the importance of evaluating algorithms across a wide range of environments and the need for significance testing to assess the reliability of reported results. The authors propose guidelines and recommendations to improve reproducibility in deep RL, including the use of bootstrapping and power analysis to determine the necessary number of trials for meaningful comparisons. They conclude by discussing future directions, such as building hyperparameter-agnostic algorithms and further investigating significance metrics for RL algorithms.