March 1, 2024 | Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Will Dabney, Hado van Hasselt, Razvan Pascanu, James Martens
This paper explores the phenomenon of plasticity loss in neural networks, where the network's ability to adapt to new tasks and data distributions deteriorates over time. The authors identify multiple independent mechanisms contributing to plasticity loss, including preactivation distribution shift, parameter norm growth, and target scale in regression problems. They demonstrate that while interventions on a single mechanism are insufficient to fully prevent plasticity loss, combining multiple mechanisms can significantly enhance the network's robustness. Specifically, the combination of layer normalization and weight decay is shown to be highly effective in maintaining plasticity in various synthetic and real-world nonstationary learning tasks, including reinforcement learning in the Arcade Learning Environment. The paper also highlights the importance of target magnitude in regression tasks and the role of preactivation distribution shift in causing similar pathologies in the empirical neural tangent kernel (NTK). The findings suggest a "Swiss cheese model" of mitigation strategies, where interventions targeting different mechanisms can be combined to achieve additive benefits.This paper explores the phenomenon of plasticity loss in neural networks, where the network's ability to adapt to new tasks and data distributions deteriorates over time. The authors identify multiple independent mechanisms contributing to plasticity loss, including preactivation distribution shift, parameter norm growth, and target scale in regression problems. They demonstrate that while interventions on a single mechanism are insufficient to fully prevent plasticity loss, combining multiple mechanisms can significantly enhance the network's robustness. Specifically, the combination of layer normalization and weight decay is shown to be highly effective in maintaining plasticity in various synthetic and real-world nonstationary learning tasks, including reinforcement learning in the Arcade Learning Environment. The paper also highlights the importance of target magnitude in regression tasks and the role of preactivation distribution shift in causing similar pathologies in the empirical neural tangent kernel (NTK). The findings suggest a "Swiss cheese model" of mitigation strategies, where interventions targeting different mechanisms can be combined to achieve additive benefits.