Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning
News Release Summary
Researchers at the University of Virginia have found a way to breathe new life into a decades-old machine learning technique called pseudo-labeling, which had largely been abandoned in favor of newer approaches. The core challenge in semi-supervised learning is making the most of situations where only a small fraction of training data carries human-assigned labels, while the rest remains unlabeled — a common and expensive problem in computer vision. The team's method, called Curriculum Labeling, works by first training a model on the small labeled dataset, then gradually assigning predicted labels to unlabeled images in stages, starting only with the predictions the model is most confident about and slowly incorporating harder, less certain examples over successive rounds. Two specific design choices proved critical: using a threshold derived from Extreme Value Theory to determine which unlabeled samples to include at each stage, rather than relying on fixed hand-tuned cutoffs, and completely resetting the model's parameters before each new training round rather than simply continuing to fine-tune it — a step that prevents the model from reinforcing its own early mistakes over time. Tested on standard image classification benchmarks, the approach reached 94.91% accuracy on CIFAR-10 using just 4,000 labeled images and matched the performance of leading competing methods on ImageNet using only 10% of the labeled data. The researchers also showed the method holds up better than most alternatives when the unlabeled data contains images from categories not present in the labeled set, a more realistic scenario than the clean splits typically used in academic evaluations. The work suggests that self-training approaches were not inherently flawed, but simply needed more careful implementation.
abstract
In this paper we revisit the idea of pseudo-labeling in the context of semi-supervised learning where a learning algorithm has access to a small set of labeled samples and a large set of unlabeled samples. Pseudo-labeling works by applying pseudo-labels to samples in the unlabeled set by using a model trained on the combination of the labeled samples and any previously pseudo-labeled samples, and iteratively repeating this process in a self-training cycle. Current methods seem to have abandoned this approach in favor of consistency regularization methods that train models under a combination of different styles of self-supervised losses on the unlabeled samples and standard supervised losses on the labeled samples. We empirically demonstrate that pseudo-labeling can in fact be competitive with the state-of-the-art, while being more resilient to out-of-distribution samples in the unlabeled set. We identify two key factors that allow pseudo-labeling to achieve such remarkable results (1) applying curriculum learning principles and (2) avoiding concept drift by restarting model parameters before each self-training cycle. We obtain 94.91% accuracy on CIFAR-10 using only 4,000 labeled samples, and 68.87% top-1 accuracy on Imagenet-ILSVRC using only 10% of the labeled samples. The code is available at https://github.com/uvavision/Curriculum-Labeling
details
citation
@inproceedings{cascantebonilla2021curriculum,
title = {Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning},
author = {Cascante-Bonilla, Paola and Tan, Fuwen and Qi, Yanjun and Ordonez, Vicente},
year = {2021},
booktitle = {The Thirty-Fifth AAAI Conference on Artificial Intelligence. AAAI 2021},
url = {https://arxiv.org/abs/2001.06001},
}