30 Jan 2019 | Peter Henderson1*, Riashat Islam1,2*, Philip Bachman2 Joelle Pineau1, Doina Precup1, David Meger1
This paper addresses the challenges of reproducibility and proper experimental reporting in deep reinforcement learning (RL). The authors highlight that non-determinism in benchmark environments and variance in methods can make reported results difficult to interpret. They investigate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. The paper emphasizes the importance of significance testing and standardized experimental reporting to ensure that improvements over the prior state-of-the-art are meaningful.
The authors focus on policy gradient methods in continuous control, including Trust Region Policy Optimization (TRPO), Deep Deterministic Policy Gradients (DDPG), Proximal Policy Optimization (PPO), and Actor-Critic using Kronecker-Factored Trust Region (ACKTR). These methods have shown promising results in continuous control tasks from OpenAI Gym. The paper discusses the impact of hyperparameters, network architecture, and reward scaling on algorithm performance. It shows that small changes in these factors can significantly affect results, and that proper significance testing is essential to determine if improvements are meaningful.
The authors also examine the effects of random seeds and the number of trials on performance, finding that results can vary significantly depending on these factors. They highlight the importance of using a sufficient number of trials to ensure reliable results and the need for proper significance testing to determine if higher average returns are representative of better performance.
The paper also discusses the impact of environment properties on algorithm performance, showing that results can vary significantly across different environments. It emphasizes the importance of evaluating algorithms across a wide range of environments to ensure that results are not biased towards specific settings.
The authors also investigate the impact of codebases on algorithm performance, finding that differences in implementation can have a significant impact on results. They emphasize the importance of using consistent codebases and reporting all implementation details to ensure reproducibility.
Finally, the paper discusses the importance of reporting evaluation metrics and the need for standardized methods to assess the significance of algorithm performance. It suggests that significance testing and confidence intervals can be used to evaluate algorithm performance and determine if improvements are meaningful.
Overall, the paper emphasizes the importance of reproducibility, proper experimental reporting, and significance testing in deep RL to ensure that improvements are meaningful and that the field continues to progress.This paper addresses the challenges of reproducibility and proper experimental reporting in deep reinforcement learning (RL). The authors highlight that non-determinism in benchmark environments and variance in methods can make reported results difficult to interpret. They investigate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. The paper emphasizes the importance of significance testing and standardized experimental reporting to ensure that improvements over the prior state-of-the-art are meaningful.
The authors focus on policy gradient methods in continuous control, including Trust Region Policy Optimization (TRPO), Deep Deterministic Policy Gradients (DDPG), Proximal Policy Optimization (PPO), and Actor-Critic using Kronecker-Factored Trust Region (ACKTR). These methods have shown promising results in continuous control tasks from OpenAI Gym. The paper discusses the impact of hyperparameters, network architecture, and reward scaling on algorithm performance. It shows that small changes in these factors can significantly affect results, and that proper significance testing is essential to determine if improvements are meaningful.
The authors also examine the effects of random seeds and the number of trials on performance, finding that results can vary significantly depending on these factors. They highlight the importance of using a sufficient number of trials to ensure reliable results and the need for proper significance testing to determine if higher average returns are representative of better performance.
The paper also discusses the impact of environment properties on algorithm performance, showing that results can vary significantly across different environments. It emphasizes the importance of evaluating algorithms across a wide range of environments to ensure that results are not biased towards specific settings.
The authors also investigate the impact of codebases on algorithm performance, finding that differences in implementation can have a significant impact on results. They emphasize the importance of using consistent codebases and reporting all implementation details to ensure reproducibility.
Finally, the paper discusses the importance of reporting evaluation metrics and the need for standardized methods to assess the significance of algorithm performance. It suggests that significance testing and confidence intervals can be used to evaluate algorithm performance and determine if improvements are meaningful.
Overall, the paper emphasizes the importance of reproducibility, proper experimental reporting, and significance testing in deep RL to ensure that improvements are meaningful and that the field continues to progress.