Opening the black box of Deep Neural Networks via Information

Opening the black box of Deep Neural Networks via Information

29 Apr 2017 | Ravid Schwartz-Ziv, Naftali Tishby
This paper explores the inner workings of Deep Neural Networks (DNNs) through the lens of information theory, focusing on the Information Plane. The authors propose that DNNs optimize the Information Bottleneck (IB) trade-off between compression and prediction, layer by layer. They demonstrate that most training epochs are spent on compressing input data rather than fitting training labels. The compression phase begins when training errors are small, and the Stochastic Gradient Descent (SGD) transitions from rapid error reduction to a stochastic relaxation process. Converged layers lie close to the IB theoretical bound, with encoder and decoder distributions satisfying IB self-consistent equations. Adding more hidden layers reduces training time due to faster relaxation, as information compression scales super-linearly with the previous layer's compression. The authors also suggest that hidden layers may converge to critical points on the IB curve, explaining their effectiveness. The analysis reveals two distinct phases in SGD optimization: empirical error minimization (ERM) and representation compression. During ERM, gradients are large and stable, increasing mutual information on labels. During compression, gradients are noisy and random, reducing mutual information on inputs. The study shows that hidden layers provide computational benefits by accelerating relaxation, and that the IB framework explains the efficiency of DNNs. The results suggest that DNNs learn optimal representations through stochastic relaxation, and that the IB bound is closely approached by converged layers. The findings highlight the importance of information theory in understanding DNNs and their training dynamics.This paper explores the inner workings of Deep Neural Networks (DNNs) through the lens of information theory, focusing on the Information Plane. The authors propose that DNNs optimize the Information Bottleneck (IB) trade-off between compression and prediction, layer by layer. They demonstrate that most training epochs are spent on compressing input data rather than fitting training labels. The compression phase begins when training errors are small, and the Stochastic Gradient Descent (SGD) transitions from rapid error reduction to a stochastic relaxation process. Converged layers lie close to the IB theoretical bound, with encoder and decoder distributions satisfying IB self-consistent equations. Adding more hidden layers reduces training time due to faster relaxation, as information compression scales super-linearly with the previous layer's compression. The authors also suggest that hidden layers may converge to critical points on the IB curve, explaining their effectiveness. The analysis reveals two distinct phases in SGD optimization: empirical error minimization (ERM) and representation compression. During ERM, gradients are large and stable, increasing mutual information on labels. During compression, gradients are noisy and random, reducing mutual information on inputs. The study shows that hidden layers provide computational benefits by accelerating relaxation, and that the IB framework explains the efficiency of DNNs. The results suggest that DNNs learn optimal representations through stochastic relaxation, and that the IB bound is closely approached by converged layers. The findings highlight the importance of information theory in understanding DNNs and their training dynamics.
Reach us at info@study.space
Understanding Opening the Black Box of Deep Neural Networks via Information