Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

15 Aug 2013 | Yoshua Bengio, Nicholas Léonard and Aaron Courville
This paper addresses the challenge of estimating gradients through stochastic neurons in deep learning models, particularly for conditional computation. Stochastic neurons and hard non-linearities can be useful in deep learning, but they pose a challenge in estimating gradients. The authors examine this problem and compare four families of solutions applicable in different settings. One solution is the minimum variance unbiased gradient estimator for stochastic binary neurons, a special case of the REINFORCE algorithm. Another approach decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part. A third approach involves injecting additive or multiplicative noise into a differentiable computational graph. A fourth approach, called the straight-through estimator, heuristically copies the gradient with respect to the stochastic output as an estimator of the gradient with respect to the sigmoid argument. The paper discusses the use of stochastic neurons for conditional computation, where sparse stochastic units form a distributed representation of gates that can turn off large chunks of the computation in the neural network. This can significantly reduce computational costs for large deep networks. The authors propose a novel stochastic unit called the STS unit, which can be trained with ordinary gradient descent and allows for back-propagation through the computational graph. They also present an unbiased estimator for the gradient through stochastic binary neurons, which is a special case of the REINFORCE algorithm. The paper also introduces the straight-through estimator, which is a biased estimator but has the right sign for single-layer neurons. Experiments show that all tested methods allow training to proceed, with the gater using noisy rectifiers yielding better results than the non-noisy baseline rectifiers. The sigmoid baseline with noise also performed better than without, suggesting that injecting noise can be useful for exploring good parameters and fitting the training objective. The straight-through units provided the best validation and test error, and are very simple to implement. The paper concludes that these methods can be useful for biological models and for computational efficiency in reducing computation via conditional computation or sparse updates.This paper addresses the challenge of estimating gradients through stochastic neurons in deep learning models, particularly for conditional computation. Stochastic neurons and hard non-linearities can be useful in deep learning, but they pose a challenge in estimating gradients. The authors examine this problem and compare four families of solutions applicable in different settings. One solution is the minimum variance unbiased gradient estimator for stochastic binary neurons, a special case of the REINFORCE algorithm. Another approach decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part. A third approach involves injecting additive or multiplicative noise into a differentiable computational graph. A fourth approach, called the straight-through estimator, heuristically copies the gradient with respect to the stochastic output as an estimator of the gradient with respect to the sigmoid argument. The paper discusses the use of stochastic neurons for conditional computation, where sparse stochastic units form a distributed representation of gates that can turn off large chunks of the computation in the neural network. This can significantly reduce computational costs for large deep networks. The authors propose a novel stochastic unit called the STS unit, which can be trained with ordinary gradient descent and allows for back-propagation through the computational graph. They also present an unbiased estimator for the gradient through stochastic binary neurons, which is a special case of the REINFORCE algorithm. The paper also introduces the straight-through estimator, which is a biased estimator but has the right sign for single-layer neurons. Experiments show that all tested methods allow training to proceed, with the gater using noisy rectifiers yielding better results than the non-noisy baseline rectifiers. The sigmoid baseline with noise also performed better than without, suggesting that injecting noise can be useful for exploring good parameters and fitting the training objective. The straight-through units provided the best validation and test error, and are very simple to implement. The paper concludes that these methods can be useful for biological models and for computational efficiency in reducing computation via conditional computation or sparse updates.
Reach us at info@study.space