17 Jun 2019 | Avital Oliver; Augustus Odena; Colin Raffel; Ekin D. Cubuk & Ian J. Goodfellow
This paper evaluates the real-world applicability of semi-supervised learning (SSL) algorithms by addressing several issues that are often overlooked in standard benchmarks. The authors argue that current evaluation methods do not accurately reflect the challenges SSL algorithms face in practical applications. To address this, they create a unified reimplementation of various widely-used SSL techniques and conduct a series of experiments designed to test their performance under realistic conditions.
Key findings include:
- The performance gap between SSL and fully-supervised methods is often smaller than reported.
- SSL methods vary in their sensitivity to the amount of labeled and unlabeled data.
- Performance can degrade significantly when the unlabeled dataset contains out-of-distribution examples.
- Small validation sets can lead to noisy hyperparameter tuning, limiting the effectiveness of SSL methods.
The authors recommend several improvements to SSL evaluation, including:
- Using the same underlying model for all comparisons.
- Reporting well-tuned fully-supervised and transfer learning baselines.
- Studying scenarios with class distribution mismatch.
- Varying both the amount of labeled and unlabeled data.
- Avoiding over-tuning hyperparameters on unrealistically large validation sets.
These findings highlight the need for more realistic evaluation methods to better guide the development of SSL algorithms for real-world applications.This paper evaluates the real-world applicability of semi-supervised learning (SSL) algorithms by addressing several issues that are often overlooked in standard benchmarks. The authors argue that current evaluation methods do not accurately reflect the challenges SSL algorithms face in practical applications. To address this, they create a unified reimplementation of various widely-used SSL techniques and conduct a series of experiments designed to test their performance under realistic conditions.
Key findings include:
- The performance gap between SSL and fully-supervised methods is often smaller than reported.
- SSL methods vary in their sensitivity to the amount of labeled and unlabeled data.
- Performance can degrade significantly when the unlabeled dataset contains out-of-distribution examples.
- Small validation sets can lead to noisy hyperparameter tuning, limiting the effectiveness of SSL methods.
The authors recommend several improvements to SSL evaluation, including:
- Using the same underlying model for all comparisons.
- Reporting well-tuned fully-supervised and transfer learning baselines.
- Studying scenarios with class distribution mismatch.
- Varying both the amount of labeled and unlabeled data.
- Avoiding over-tuning hyperparameters on unrealistically large validation sets.
These findings highlight the need for more realistic evaluation methods to better guide the development of SSL algorithms for real-world applications.