[slides and audio] Weight-averaged consistency targets improve semi-supervised deep learning results

The paper introduces Mean Teacher, a method for semi-supervised deep learning that improves upon Temporal Ensembling. Temporal Ensembling maintains an exponential moving average of label predictions for each training example and penalizes inconsistent predictions. However, this approach becomes inefficient for large datasets due to the infrequent updates of targets. Mean Teacher addresses this by averaging model weights instead of label predictions, allowing for more frequent updates and better performance on large datasets. The method also scales well to online learning and requires fewer labeled examples. Experiments on datasets like SVHN and CIFAR-10 show that Mean Teacher achieves higher test accuracy and outperforms Temporal Ensembling, especially with limited labeled data. The paper also highlights the importance of network architecture and demonstrates significant improvements when combined with Residual Networks. Additionally, the authors explore the impact of various hyperparameters and compare different consistency cost functions, concluding that mean squared error (MSE) performs better than KL-divergence.The paper introduces Mean Teacher, a method for semi-supervised deep learning that improves upon Temporal Ensembling. Temporal Ensembling maintains an exponential moving average of label predictions for each training example and penalizes inconsistent predictions. However, this approach becomes inefficient for large datasets due to the infrequent updates of targets. Mean Teacher addresses this by averaging model weights instead of label predictions, allowing for more frequent updates and better performance on large datasets. The method also scales well to online learning and requires fewer labeled examples. Experiments on datasets like SVHN and CIFAR-10 show that Mean Teacher achieves higher test accuracy and outperforms Temporal Ensembling, especially with limited labeled data. The paper also highlights the importance of network architecture and demonstrates significant improvements when combined with Residual Networks. Additionally, the authors explore the impact of various hyperparameters and compare different consistency cost functions, concluding that mean squared error (MSE) performs better than KL-divergence.

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

16 Apr 2018 | Antti Tarvainen and Harri Valpola