• When facing a limited amount of labeled data for supervised learning tasks, four approaches are commonly discussed.
    1. Pre-training + fine-tuning: Pre-train a powerful task-agnostic model on a large unsupervised data corpus, e.g. pre-training LMs on free text, or pre-training vision models on unlabelled images via self-supervised learning, and then fine-tune it on the downstream task with a small set of labeled samples.
    2. Semi-supervised learning: Learn from the labelled and unlabeled samples together. A lot of research has happened on vision tasks within this approach.
    3. Active learning: Labeling is expensive, but we still want to collect more given a cost budget. Active learning learns to select most valuable unlabeled samples to be collected next and helps us act smartly with a limited budget.
    4. Pre-training + dataset auto-generation: Given a capable pre-trained model, we can utilize it to auto-generate a lot more labeled samples. This has been especially popular within the language domain driven by the success of few-shot learning. (View Highlight)
  • Semi-supervised learning uses both labeled and unlabeled data to train a model. Interestingly most existing literature on semi-supervised learning focuses on vision tasks. And instead pre-training + fine-tuning is a more common paradigm for language tasks. (View Highlight)
  • All the methods introduced in this post have a loss combining two parts: L=Ls+μ(t)Lu. The supervised loss Ls is easy to get given all the labeled examples. We will focus on how the unsupervised loss Lu is designed. A common choice of the weighting term μ(t) is a ramp function increasing the importance of Lu in time, where t is the training step. (View Highlight)
  • Several hypotheses have been discussed in literature to support certain design decisions in semi-supervised learning methods. • H1: Smoothness Assumptions: If two data samples are close in a high-density region of the feature space, their labels should be the same or very similar. • H2: Cluster Assumptions: The feature space has both dense regions and sparse regions. Densely grouped data points naturally form a cluster. Samples in the same cluster are expected to have the same label. This is a small extension of H1. • H3: Low-density Separation Assumptions: The decision boundary between classes tends to be located in the sparse, low density regions, because otherwise the decision boundary would cut a high-density cluster into two classes, corresponding to two clusters, which invalidates H1 and H2. • H4: Manifold Assumptions: The high-dimensional data tends to locate on a low-dimensional manifold. Even though real-world data might be observed in very high dimensions (e.g. such as images of real-world objects/scenes), they actually can be captured by a lower dimensional manifold where certain attributes are captured and similar points are grouped closely (e.g. images of real-world objects/scenes are not drawn from a uniform distribution over all pixel combinations). This enables us to learn a more efficient representation for us to discover and measure similarity between unlabeled data points. This is also the foundation for representation learning. (View Highlight)
  • Consistency Regularization, also known as Consistency Training, assumes that randomness within the neural network (e.g. with Dropout) or data augmentation transformations should not modify model predictions given the same input. Every method in this section has a consistency regularization loss as Lu. This idea has been adopted in several self-supervised learning methods, such as SimCLR, BYOL, SimCSE, etc. Different augmented versions of the same sample should result in the same representation. Cross-view training in language modeling and multi-view learning in self-supervised learning all share the same motivation. (View Highlight)
  • Π-model# Fig. 1. Overview of the Π-model. Two versions of the same input with different stochastic augmentation and dropout masks pass through the network and the outputs are expected to be consistent. (Image source: Laine & Aila (2017)) Sajjadi et al. (2016) proposed an unsupervised learning loss to minimize the difference between two passes through the network with stochastic transformations (e.g. dropout, random max-pooling) for the same data point. The label is not explicitly used, so the loss can be applied to unlabeled dataset. Laine & Aila (2017) later coined the name, Π-Model, for such a setup. (View Highlight)
  • Temporal ensembling# Fig. 2. Overview of Temporal Ensembling. The per-sample EMA label prediction is the learning target. (Image source: Laine & Aila (2017)) Π-model requests the network to run two passes per sample, doubling the computation cost. To reduce the cost, Temporal Ensembling (Laine & Aila 2017) maintains an exponential moving average (EMA) of the model prediction in time per training sample zi as the learning target, which is only evaluated and updated once per epoch. Because the ensemble output zi is initialized to 0, it is normalized by (1−αt) to correct this startup bias. Adam optimizer has such bias correction terms for the same reason. zi(t)=αzi(t−1)+(1−α)zi1−αt where z~(t) is the ensemble prediction at epoch t and zi is the model prediction in the current round. Note that since z~(0)=0, with correction, z~(1) is simply equivalent to zi at epoch 1. (View Highlight)
  • Mean teachers# Fig. 3. Overview of the Mean Teacher framework. (Image source: Tarvaninen & Valpola, 2017) Temporal Ensembling keeps track of an EMA of label predictions for each training sample as a learning target. However, this label prediction only changes every epoch, making the approach clumsy when the training dataset is large. Mean Teacher (Tarvaninen & Valpola, 2017) is proposed to overcome the slowness of target update by tracking the moving average of model weights instead of model outputs. (View Highlight)
  • The consistency regularization loss is the distance between predictions by the student and teacher and the student-teacher gap should be minimized. The mean teacher is expected to provide more accurate predictions than the student. It got confirmed in the empirical experiments, as shown in Fig. 4. Fig. 4. Classification error on SVHN of Mean Teacher and the Π Model. The mean teacher (in orange) has better performance than the student model (in blue). (Image source: Tarvaninen & Valpola, 2017) (View Highlight)
  • According to their ablation studies, • Input augmentation (e.g. random flips of input images, Gaussian noise) or student model dropout is necessary for good performance. Dropout is not needed on the teacher model. • The performance is sensitive to the EMA decay hyperparameter β. A good strategy is to use a small β=0.99 during the ramp up stage and a larger β=0.999 in the later stage when the student model improvement slows down. • They found that MSE as the consistency cost function performs better than other cost functions like KL divergence. (View Highlight)
  • Noisy samples as learning targets# Several recent consistency training methods learn to minimize prediction difference between the original unlabeled sample and its corresponding augmented version. It is quite similar to the Π-model but the consistency regularization loss is only applied to the unlabeled data. Fig. 5. Consistency training with noisy samples. Adversarial Training (Goodfellow et al. 2014) applies adversarial noise onto the input and trains the model to be robust to such adversarial attack. (View Highlight)
  • Virtual Adversarial Training (VAT; Miyato et al. 2018) extends the idea to work in semi-supervised learning. Because q(y∣xl) is unknown, VAT replaces it with the current model prediction for the original input with the current weights θ^. Note that θ^ is a fixed copy of model weights, so there is no gradient update on θ^. LuVAT(x,θ)=D[pθ^(y∣x),pθ(y∣x+rvadv)]rvadv=arg⁡maxr;‖r‖≤ϵD[pθ^(y∣x),pθ(y∣x+r)] The VAT loss applies to both labeled and unlabeled samples. It is a negative smoothness measure of the current model’s prediction manifold at each data point. The optimization of such loss motivates the manifold to be smoother. (View Highlight)
  • Interpolation Consistency Training (ICT; Verma et al. 2019) enhances the dataset by adding more interpolations of data points and expects the model prediction to be consistent with interpolations of the corresponding labels. MixUp (Zheng et al. 2018) operation mixes two images via a simple weighted sum and combines it with label smoothing. Following the idea of MixUp, ICT expects the prediction model to produce a label on a mixup sample to match the interpolation of predictions of corresponding inputs: mixupλ(xi,xj)=λxi+(1−λ)xjp(mixupλ(y∣xi,xj))≈λp(y∣xi)+(1−λ)p(y∣xj) where θ′ is a moving average of θ, which is a mean teacher. Fig. 6. Overview of Interpolation Consistency Training. MixUp is applied to produce more interpolated samples with interpolated labels as learning targets. (Image source: Verma et al. 2019) (View Highlight)
  • Similar to VAT, Unsupervised Data Augmentation (UDA; Xie et al. 2020) learns to predict the same output for an unlabeled example and the augmented one. UDA especially focuses on studying how the “quality” of noise can impact the semi-supervised learning performance with consistency training. It is crucial to use advanced data augmentation methods for producing meaningful and effective noisy samples. Good data augmentation should produce valid (i.e. does not change the label) and diverse noise, and carry targeted inductive biases. (View Highlight)
  • For images, UDA adopts RandAugment (Cubuk et al. 2019) which uniformly samples augmentation operations available in PIL, no learning or optimization, so it is much cheaper than AutoAugment. (View Highlight)
  • For language, UDA combines back-translation and TF-IDF based word replacement. Back-translation preserves the high-level meaning but may not retain certain words, while TF-IDF based word replacement drops uninformative words with low TF-IDF scores. In the experiments on language tasks, they found UDA to be complementary to transfer learning and representation learning; For example, BERT fine-tuned (i.e. BERTFINETUNE in Fig. 8.) on in-domain unlabeled data can further improve the performance. (View Highlight)
  • When calculating Lu, UDA found two training techniques to help improve the results. • Low confidence masking: Mask out examples with low prediction confidence if lower than a threshold τ. • Sharpening prediction distribution: Use a low temperature T in softmax to sharpen the predicted probability distribution. • In-domain data filtration: In order to extract more in-domain data from a large out-of-domain dataset, they trained a classifier to predict in-domain labels and then retain samples with high confidence predictions as in-domain candidates. 𝟙LuUDA=1[maxy′pθ^(y′∣x)>τ]⋅D[pθ^(sharp)(y∣x;T),pθ(y∣x¯)]where pθ^(sharp)(y∣x;T)=exp⁡(z(y)/T)∑y′exp⁡(z(y′)/T) where θ^ is a fixed copy of model weights, same as in VAT, so no gradient update, and x¯ is the augmented data point. τ is the prediction confidence threshold and T is the distribution sharpening temperature. (View Highlight)
  • Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities predicted by the current model and then trains the model on both labeled and unlabeled samples simultaneously in a pure supervised setup. Why could pseudo labels work? Pseudo label is in effect equivalent to Entropy Regularization (Grandvalet & Bengio 2004), which minimizes the conditional entropy of class probabilities for unlabeled data to favor low density separation between classes. In other words, the predicted class probabilities is in fact a measure of class overlap, minimizing the entropy is equivalent to reduced class overlap and thus low density separation. Fig. 9. t-SNE visualization of outputs on MNIST test set by models training (a) without and (b) with pseudo labeling on 60000 unlabeled samples, in addition to 600 labeled data. Pseudo labeling leads to better segregation in the learned embedding space. (Image source: Lee 2013) Training with pseudo labeling naturally comes as an iterative process. We refer to the model that produces pseudo labels as teacher and the model that learns with pseudo labels as student. (View Highlight)
  • Label propagation# Label Propagation (Iscen et al. 2019) is an idea to construct a similarity graph among samples based on feature embedding. Then the pseudo labels are “diffused” from known samples to unlabeled ones where the propagation weights are proportional to pairwise similarity scores in the graph. Conceptually it is similar to a k-NN classifier and both suffer from the problem of not scaling up well with a large dataset. Fig. 10. Illustration of how Label Propagation works. (Image source: Iscen et al. 2019) (View Highlight)
  • Self-Training# Self-Training is not a new concept (Scudder 1965, Nigram & Ghani CIKM 2000). It is an iterative algorithm, alternating between the following two steps until every unlabeled sample has a label assigned: • Initially it builds a classifier on labeled data. • Then it uses this classifier to predict labels for the unlabeled data and converts the most confident ones into labeled samples. (View Highlight)
  • Xie et al. (2020) applied self-training in deep learning and achieved great results. On the ImageNet classification task, they first trained an EfficientNet (Tan & Le 2019) model as teacher to generate pseudo labels for 300M unlabeled images and then trained a larger EfficientNet as student to learn with both true labeled and pseudo labeled images. One critical element in their setup is to have noise during student model training but have no noise for the teacher to produce pseudo labels. Thus their method is called Noisy Student. They applied stochastic depth (Huang et al. 2016), dropout and RandAugment to noise the student. Noise is important for the student to perform better than the teacher. The added noise has a compound effect to encourage the model’s decision making frontier to be smooth, on both labeled and unlabeled data. (View Highlight)
  • A few other important technical configs in noisy student self-training are: • The student model should be sufficiently large (i.e. larger than the teacher) to fit more data. • Noisy student should be paired with data balancing, especially important to balance the number of pseudo labeled images in each class. • Soft pseudo labels work better than hard ones. Noisy student also improves adversarial robustness against an FGSM (Fast Gradient Sign Attack = The attack uses the gradient of the loss w.r.t the input data and adjusts the input data to maximize the loss) attack though the model is not optimized for adversarial robustness. SentAugment, proposed by Du et al. (2020), aims to solve the problem when there is not enough in-domain unlabeled data for self-training in the language domain. It relies on sentence embedding to find unlabeled in-domain samples from a large corpus and uses the retrieved sentences for self-training. (View Highlight)
  • Reducing confirmation bias# Confirmation bias is a problem with incorrect pseudo labels provided by an imperfect teacher model. Overfitting to wrong labels may not give us a better student model. To reduce confirmation bias, Arazo et al. (2019) proposed two techniques. One is to adopt MixUp with soft labels. Given two samples, (xi,xj) and their corresponding true or pseudo labels (yi,yj), the interpolated label equation can be translated to a cross entropy loss with softmax outputs: x¯=λxi+(1−λ)xjy¯=λyi+(1−λ)yj⇔L=λ[yi⊤log⁡fθ(x¯)]+(1−λ)[yj⊤log⁡fθ(x¯)] Mixup is insufficient if there are too few labeled samples. They further set a minimum number of labeled samples in every mini batch by oversampling the labeled samples. This works better than upweighting labeled samples, because it leads to more frequent updates rather than few updates of larger magnitude which could be less stable. Like consistency regularization, data augmentation and dropout are also important for pseudo labeling to work well. (View Highlight)
  • Meta Pseudo Labels (Pham et al. 2021) adapts the teacher model constantly with the feedback of how well the student performs on the labeled dataset. The teacher and the student are trained in parallel, where the teacher learns to generate better pseudo labels and the student learns from the pseudo labels. (View Highlight)
  • Pseudo Labeling with Consistency Regularization# It is possible to combine the above two approaches together, running semi-supervised learning with both pseudo labeling and consistency training. (View Highlight)
  • MixMatch# MixMatch (Berthelot et al. 2019), as a holistic approach to semi-supervised learning, utilizes unlabeled data by merging the following techniques:
    1. Consistency regularization: Encourage the model to output the same predictions on perturbed unlabeled samples.
    2. Entropy minimization: Encourage the model to output confident predictions on unlabeled data.
    3. MixUp augmentation: Encourage the model to have linear behaviour between samples. Given a batch of labeled data X and unlabeled data U, we create augmented versions of them via MixMatch(.), X¯ and U¯, containing augmented samples and guessed labels for unlabeled examples. (View Highlight)
  • For each u, MixMatch generates K augmentations, u¯(k)=Augment(u) for k=1,…,K and the pseudo label is guessed based on the average: y^=1K∑k=1Kpθ(y∣u¯(k)). Fig. 12. The process of “label guessing” in MixMatch: averaging K augmentations, correcting the predicted marginal distribution and finally sharpening the distribution. (Image source: Berthelot et al. 2019) According to their ablation studies, it is critical to have MixUp especially on the unlabeled data. Removing temperature sharpening on the pseudo label distribution hurts the performance quite a lot. Average over multiple augmentations for label guessing is also necessary (View Highlight)
  • ReMixMatch (Berthelot et al. 2020) improves MixMatch by introducing two new mechanisms: Fig. 13. Illustration of two improvements introduced in ReMixMatch over MixMatch. (Image source: Berthelot et al. 2020) • Distribution alignment. It encourages the marginal distribution p(y) to be close to the marginal distribution of the ground truth labels. Let p(y) be the class distribution in the true labels and p~(y^) be a running average of the predicted class distribution among the unlabeled data. The model prediction on an unlabeled sample pθ(y|u) is normalized to be Normalize(pθ(y|u)p(y)p~(y^)) to match the true marginal distribution. • Note that entropy minimization is not a useful objective if the marginal distribution is not uniform. • I do feel the assumption that the class distributions on the labeled and unlabeled data should match is too strong and not necessarily to be true in the real-world setting. (View Highlight)
  • Augmentation anchoring. Given an unlabeled sample, it first generates an “anchor” version with weak augmentation and then averages K strongly augmented versions using CTAugment (Control Theory Augment). CTAugment only samples augmentations that keep the model predictions within the network tolerance. The ReMixMatch loss is a combination of several terms, • a supervised loss with data augmentation and MixUp applied; • an unsupervised loss with data augmentation and MixUp applied, using pseudo labels as targets; • a CE loss on a single heavily-augmented unlabeled image without MixUp; • a rotation loss as in self-supervised learning. (View Highlight)
  • DivideMix# DivideMix (Junnan Li et al. 2020) combines semi-supervised learning with Learning with noisy labels (LNL). It models the per-sample loss distribution via a GMM to dynamically divide the training data into a labeled set with clean examples and an unlabeled set with noisy ones. Following the idea in Arazo et al. 2019, they fit a two-component GMM on the per-sample cross entropy loss ℓi=yi⊤log⁡fθ(xi). Clean samples are expected to get lower loss faster than noisy samples. The component with smaller mean is the cluster corresponding to clean labels and let’s denote it as c. If the GMM posterior probability wi=pGMM(c∣ℓi) (i.e. the probability of the sampling belonging to the clean sample set) is larger than the threshold τ, this sample is considered as a clean sample and otherwise a noisy one. (View Highlight)
  • Compared to MixMatch, DivideMix has an additional co-divide stage for handling noisy samples, as well as the following improvements during training: • Label co-refinement: It linearly combines the ground-truth label yi with the network’s prediction y^i, which is averaged across multiple augmentations of xi, guided by the clean set probability wi produced by the other network. • Label co-guessing: It averages the predictions from two models for unlabelled data samples. (View Highlight)
  • FixMatch# FixMatch (Sohn et al. 2020) generates pseudo labels on unlabeled samples with weak augmentation and only keeps predictions with high confidence. Here both weak augmentation and high confidence filtering help produce high-quality trustworthy pseudo label targets. Then FixMatch learns to predict these pseudo labels given a heavily-augmented sample. Fig. 16. Illustration of how FixMatch works. (Image source: Sohn et al. 2020) (View Highlight)
  • Combined with Powerful Pre-Training# It is a common paradigm, especially in language tasks, to first pre-train a task-agnostic model on a large unsupervised data corpus via self-supervised learning and then fine-tune it on the downstream task with a small labeled dataset. Research has shown that we can obtain extra gain if combining semi-supervised learning with pretraining. (View Highlight)
  • Zoph et al. (2020) studied to what degree self-training can work better than pre-training. Their experiment setup was to use ImageNet for pre-training or self-training to improve COCO. Note that when using ImageNet for self-training, it discards labels and only uses ImageNet samples as unlabeled data points. He et al. (2018) has demonstrated that ImageNet classification pre-training does not work well if the downstream task is very different, such as object detection. Fig. 18. The effect of (a) data augment (from weak to strong) and (b) the labeled dataset size on the object detection performance. In the legend: Rand Init refers to a model initialized w/ random weights; ImageNet is initialized with a pre-trained checkpoint at 84.5% top-1 ImageNet accuracy; ImageNet++ is initialized with a checkpoint with a higher accuracy 86.9%. (Image source: Zoph et al. 2020) Their experiments demonstrated a series of interesting findings: • The effectiveness of pre-training diminishes with more labeled samples available for the downstream task. Pre-training is helpful in the low-data regimes (20%) but neutral or harmful in the high-data regime. • Self-training helps in high data/strong augmentation regimes, even when pre-training hurts. • Self-training can bring in additive improvement on top of pre-training, even using the same data source. • Self-supervised pre-training (e.g. via SimCLR) hurts the performance in a high data regime, similar to how supervised pre-training does. • Joint-training supervised and self-supervised objectives help resolve the mismatch between the pre-training and downstream tasks. Pre-training, joint-training and self-training are all additive. • Noisy labels or un-targeted labeling (i.e. pre-training labels are not aligned with downstream task labels) is worse than targeted pseudo labeling. • Self-training is computationally more expensive than fine-tuning on a pre-trained model (View Highlight)
  • Chen et al. (2020) proposed a three-step procedure to merge the benefits of self-supervised pretraining, supervised fine-tuning and self-training together:
    1. Unsupervised or self-supervised pretrain a big model.
    2. Supervised fine-tune it on a few labeled examples. It is important to use a big (deep and wide) neural network. Bigger models yield better performance with fewer labeled samples.
    3. Distillation with unlabeled examples by adopting pseudo labels in self-training. • It is possible to distill the knowledge from a large model into a small one because the task-specific use does not require extra capacity of the learned representation. • The distillation loss is formatted as the following, where the teacher network is fixed with weights θ^T. Ldistill=−(1−α)∑(xil,yi)∈X[log⁡pθS(yi∣xil)]⏟Supervised loss−α∑ui∈U[∑i=1Lpθ^T(y(i)∣ui;T)log⁡pθS(y(i)∣ui;T)]⏟Distillation loss using unlabeled data Fig. 19. A semi-supervised learning framework leverages unlabeled data corpus by (Left) task-agnostic unsupervised pretraining and (Right) task-specific self-training and distillation. (Image source: Chen et al. 2020) (View Highlight)
  • 💡 Quick summary of common themes among recent semi-supervised learning methods, many aiming to reduce confirmation bias: • Apply valid and diverse noise to samples by advanced data augmentation methods. • When dealing with images, MixUp is an effective augmentation. Mixup could work on language too, resulting in a small incremental improvement (Guo et al. 2019). • Set a threshold and discard pseudo labels with low confidence. • Set a minimum number of labeled samples per mini-batch. • Sharpen the pseudo label distribution to reduce the class overlap. (View Highlight)