2009 | Andrew B. Goldberg, Xiaojin Zhu, Aarti Singh, Zhitong Xu, Robert Nowak
This paper presents a multi-manifold semi-supervised learning (SSL) algorithm that improves performance by leveraging unlabeled data in scenarios where data lies on multiple intersecting manifolds. The authors analyze the potential gain of using unlabeled data in such settings and propose an SSL algorithm that separates different manifolds into decision sets and performs supervised learning within each set. The algorithm uses a novel application of Hellinger distance and size-constrained spectral clustering to identify decision sets. Experiments show that the algorithm performs well on multiple intersecting, overlapping, and noisy manifolds.
The paper first reviews the theoretical analysis of SSL under the cluster assumption, which states that the target function is locally smooth over certain subsets of the feature space. It then extends this analysis to the case where data is supported on a mixture of manifolds. The complexity of the distributions is determined by the margin, defined as the minimum separation between clusters or the minimum width of a decision set. The authors show that if the margin is larger than the typical distance between data points, then SSL can improve performance over supervised learning (SL).
In the single manifold case, the target function lies on a lower-dimensional manifold and is smooth with respect to the geodesic distance on the manifold. The algorithm uses unlabeled data to learn the geodesic distances, which reduces the dimensionality of the learning task. In the multi-manifold case, the target function is supported on multiple manifolds and can be piecewise smooth on each manifold. The algorithm resolves the manifolds and the subsets of each manifold where the decision label varies smoothly.
The authors propose a "cluster-then-label" SSL algorithm that uses unlabeled data to form decision sets, then trains a supervised learner on each decision set. The algorithm uses Hellinger distance to detect overlapping clusters and intersecting manifolds, and size-constrained spectral clustering to ensure each cluster has enough labeled and unlabeled points. The algorithm is tested on synthetic and real data sets, showing that it outperforms supervised learning in both regression and classification tasks. The results indicate that SSL can improve performance when the number of unlabeled data points is large enough to resolve the decision sets.This paper presents a multi-manifold semi-supervised learning (SSL) algorithm that improves performance by leveraging unlabeled data in scenarios where data lies on multiple intersecting manifolds. The authors analyze the potential gain of using unlabeled data in such settings and propose an SSL algorithm that separates different manifolds into decision sets and performs supervised learning within each set. The algorithm uses a novel application of Hellinger distance and size-constrained spectral clustering to identify decision sets. Experiments show that the algorithm performs well on multiple intersecting, overlapping, and noisy manifolds.
The paper first reviews the theoretical analysis of SSL under the cluster assumption, which states that the target function is locally smooth over certain subsets of the feature space. It then extends this analysis to the case where data is supported on a mixture of manifolds. The complexity of the distributions is determined by the margin, defined as the minimum separation between clusters or the minimum width of a decision set. The authors show that if the margin is larger than the typical distance between data points, then SSL can improve performance over supervised learning (SL).
In the single manifold case, the target function lies on a lower-dimensional manifold and is smooth with respect to the geodesic distance on the manifold. The algorithm uses unlabeled data to learn the geodesic distances, which reduces the dimensionality of the learning task. In the multi-manifold case, the target function is supported on multiple manifolds and can be piecewise smooth on each manifold. The algorithm resolves the manifolds and the subsets of each manifold where the decision label varies smoothly.
The authors propose a "cluster-then-label" SSL algorithm that uses unlabeled data to form decision sets, then trains a supervised learner on each decision set. The algorithm uses Hellinger distance to detect overlapping clusters and intersecting manifolds, and size-constrained spectral clustering to ensure each cluster has enough labeled and unlabeled points. The algorithm is tested on synthetic and real data sets, showing that it outperforms supervised learning in both regression and classification tasks. The results indicate that SSL can improve performance when the number of unlabeled data points is large enough to resolve the decision sets.