15 November 2019 | Jesper E. van Engelen, Holger H. Hoos
This survey provides an overview of semi-supervised learning (SSL), a machine learning approach that combines labelled and unlabelled data to improve learning performance. SSL lies between supervised and unsupervised learning, leveraging the abundance of unlabelled data alongside smaller sets of labelled data. Recent research has focused on neural networks and generative learning, with a broad range of theoretical, algorithmic, and application-based work emerging. However, no recent comprehensive survey has been published, making it difficult for researchers to access and organize this knowledge. This survey aims to fill this gap by presenting an up-to-date overview of SSL methods, covering both earlier work and recent advances. It focuses primarily on SSL classification, where most SSL research is concentrated. The survey provides a solid understanding of the main approaches and algorithms developed over the past two decades, emphasizing the most prominent and relevant work. A new taxonomy of SSL classification algorithms is proposed, highlighting different conceptual and methodological approaches for incorporating unlabelled data. The survey also shows how the fundamental assumptions underlying most SSL algorithms are closely connected, and how they relate to the well-known semi-supervised clustering assumption. The survey is structured into sections covering background, assumptions, taxonomy, inductive and transductive methods, and future prospects. It discusses the challenges and limitations of SSL, including the potential for performance degradation when unlabelled data is introduced. The survey also addresses the importance of empirical evaluation and the need for realistic benchmarks in assessing SSL methods. Finally, it highlights the significance of data selection and partitioning in evaluating SSL algorithms, and the importance of strong baselines in comparing SSL methods to supervised ones. The survey concludes with a discussion of the future directions for SSL research.This survey provides an overview of semi-supervised learning (SSL), a machine learning approach that combines labelled and unlabelled data to improve learning performance. SSL lies between supervised and unsupervised learning, leveraging the abundance of unlabelled data alongside smaller sets of labelled data. Recent research has focused on neural networks and generative learning, with a broad range of theoretical, algorithmic, and application-based work emerging. However, no recent comprehensive survey has been published, making it difficult for researchers to access and organize this knowledge. This survey aims to fill this gap by presenting an up-to-date overview of SSL methods, covering both earlier work and recent advances. It focuses primarily on SSL classification, where most SSL research is concentrated. The survey provides a solid understanding of the main approaches and algorithms developed over the past two decades, emphasizing the most prominent and relevant work. A new taxonomy of SSL classification algorithms is proposed, highlighting different conceptual and methodological approaches for incorporating unlabelled data. The survey also shows how the fundamental assumptions underlying most SSL algorithms are closely connected, and how they relate to the well-known semi-supervised clustering assumption. The survey is structured into sections covering background, assumptions, taxonomy, inductive and transductive methods, and future prospects. It discusses the challenges and limitations of SSL, including the potential for performance degradation when unlabelled data is introduced. The survey also addresses the importance of empirical evaluation and the need for realistic benchmarks in assessing SSL methods. Finally, it highlights the significance of data selection and partitioning in evaluating SSL algorithms, and the importance of strong baselines in comparing SSL methods to supervised ones. The survey concludes with a discussion of the future directions for SSL research.