self training with noisy student improves imagenet classification

(Submitted on 11 Nov 2019) We present a simple self-training method that achieves 87.4% top-1 accuracy on ImageNet, which is 1.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Lastly, we follow the idea of compound scaling[69] and scale all dimensions to obtain EfficientNet-L2. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. V. Le, and Z. Chen, GPipe: efficient training of giant neural networks using pipeline parallelism, A. Iscen, G. Tolias, Y. Avrithis, and O. However, in the case with 130M unlabeled images, with noise function removed, the performance is still improved to 84.3% from 84.0% when compared to the supervised baseline. Figure 1(a) shows example images from ImageNet-A and the predictions of our models. We iterate this process by putting back the student as the teacher. The results also confirm that vision models can benefit from Noisy Student even without iterative training. Proceedings of the eleventh annual conference on Computational learning theory, Proceedings of the IEEE conference on computer vision and pattern recognition, Empirical Methods in Natural Language Processing (EMNLP), Imagenet classification with deep convolutional neural networks, Domain adaptive transfer learning with specialist models, Thirty-Second AAAI Conference on Artificial Intelligence, Regularized evolution for image classifier architecture search, Inception-v4, inception-resnet and the impact of residual connections on learning. [76] also proposed to first only train on unlabeled images and then finetune their model on labeled images as the final stage. Our main results are shown in Table1. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. We use stochastic depth[29], dropout[63] and RandAugment[14]. Self-training with noisy student improves imagenet classification. For more information about the large architectures, please refer to Table7 in Appendix A.1. [2] show that Self-Training is superior to Pre-training with ImageNet Supervised Learning on a few Computer . unlabeled images. This work adopts the noisy-student learning method, and adopts 3D nnUNet as the segmentation model during the experiments, since No new U-Net is the state-of-the-art medical image segmentation method and designs task-specific pipelines for different tasks. Agreement NNX16AC86A, Is ADS down? Please This is a recurring payment that will happen monthly, If you exceed more than 500 images, they will be charged at a rate of $5 per 500 images. The width. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. Further, Noisy Student outperforms the state-of-the-art accuracy of 86.4% by FixRes ResNeXt-101 WSL[44, 71] that requires 3.5 Billion Instagram images labeled with tags. We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. We then train a student model which minimizes the combined cross entropy loss on both labeled images and unlabeled images. . Also related to our work is Data Distillation[52], which ensembled predictions for an image with different transformations to teach a student network. Our procedure went as follows. ; 2006)[book reviews], Semi-supervised deep learning with memory, Proceedings of the European Conference on Computer Vision (ECCV), Xception: deep learning with depthwise separable convolutions, K. Clark, M. Luong, C. D. Manning, and Q. V. Le, Semi-supervised sequence modeling with cross-view training, E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, AutoAugment: learning augmentation strategies from data, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, RandAugment: practical data augmentation with no separate search, Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov, Good semi-supervised learning that requires a bad gan, T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, A. Galloway, A. Golubeva, T. Tanay, M. Moussa, and G. W. Taylor, R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow, I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and harnessing adversarial examples, Semi-supervised learning by entropy minimization, Advances in neural information processing systems, K. Gu, B. Yang, J. Ngiam, Q. Noisy StudentImageNetEfficientNet-L2state-of-the-art. The best model in our experiments is a result of iterative training of teacher and student by putting back the student as the new teacher to generate new pseudo labels. 3.5B weakly labeled Instagram images. It can be seen that masks are useful in improving classification performance. Notice, Smithsonian Terms of This invariance constraint reduces the degrees of freedom in the model. Finally, in the above, we say that the pseudo labels can be soft or hard. Yalniz et al. While removing noise leads to a much lower training loss for labeled images, we observe that, for unlabeled images, removing noise leads to a smaller drop in training loss. Self-training with Noisy Student improves ImageNet classication Qizhe Xie 1, Minh-Thang Luong , Eduard Hovy2, Quoc V. Le1 1Google Research, Brain Team, 2Carnegie Mellon University fqizhex, thangluong, qvlg@google.com, hovy@cmu.edu Abstract We present Noisy Student Training, a semi-supervised learning approach that works well even when . We first report the validation set accuracy on the ImageNet 2012 ILSVRC challenge prediction task as commonly done in literature[35, 66, 23, 69] (see also [55]). The top-1 accuracy is simply the average top-1 accuracy for all corruptions and all severity degrees. This work introduces two challenging datasets that reliably cause machine learning model performance to substantially degrade and curates an adversarial out-of-distribution detection dataset called IMAGENET-O, which is the first out- of-dist distribution detection dataset created for ImageNet models. We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. We iterate this process by putting back the student as the teacher. We verify that this is not the case when we use 130M unlabeled images since the model does not overfit the unlabeled set from the training loss. The mapping from the 200 classes to the original ImageNet classes are available online.222https://github.com/hendrycks/natural-adv-examples/blob/master/eval.py. [57] used self-training for domain adaptation. A tag already exists with the provided branch name. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Our study shows that using unlabeled data improves accuracy and general robustness. The architectures for the student and teacher models can be the same or different. We apply dropout to the final classification layer with a dropout rate of 0.5. Chum, Label propagation for deep semi-supervised learning, D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, Semi-supervised learning with deep generative models, Semi-supervised classification with graph convolutional networks. Self-Training With Noisy Student Improves ImageNet Classification Abstract: We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. The ONCE (One millioN sCenEs) dataset for 3D object detection in the autonomous driving scenario is introduced and a benchmark is provided in which a variety of self-supervised and semi- supervised methods on the ONCE dataset are evaluated. First, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. The most interesting image is shown on the right of the first row. We find that using a batch size of 512, 1024, and 2048 leads to the same performance. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. Astrophysical Observatory. Using Noisy Student (EfficientNet-L2) as the teacher leads to another 0.8% improvement on top of the improved results. An important contribution of our work was to show that Noisy Student can potentially help addressing the lack of robustness in computer vision models. Code is available at this https URL.Authors: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. LeLinks:YouTube: https://www.youtube.com/c/yannickilcherTwitter: https://twitter.com/ykilcherDiscord: https://discord.gg/4H8xxDFBitChute: https://www.bitchute.com/channel/yannic-kilcherMinds: https://www.minds.com/ykilcherParler: https://parler.com/profile/YannicKilcherLinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/If you want to support me, the best thing to do is to share out the content :)If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcherPatreon: https://www.patreon.com/yannickilcherBitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cqEthereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9mMonero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n Self-training with Noisy Student improves ImageNet classificationCVPR2020, Codehttps://github.com/google-research/noisystudent, Self-training, 1, 2Self-training, Self-trainingGoogleNoisy Student, Noisy Studentstudent modeldropout, stochastic depth andaugmentationteacher modelNoisy Noisy Student, Noisy Student, 1, JFT3ImageNetEfficientNet-B00.3130K130K, EfficientNetbaseline modelsEfficientNetresnet, EfficientNet-B7EfficientNet-L0L1L2, batchsize = 2048 51210242048EfficientNet-B4EfficientNet-L0l1L2350epoch700epoch, 2EfficientNet-B7EfficientNet-L0, 3EfficientNet-L0EfficientNet-L1L0, 4EfficientNet-L1EfficientNet-L2, student modelNoisy, noisystudent modelteacher modelNoisy, Noisy, Self-trainingaugmentationdropoutstochastic depth, Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores., 12/self-training-with-noisy-student-f33640edbab2, EfficientNet-L0EfficientNet-B7B7, EfficientNet-L1EfficientNet-L0, EfficientNetsEfficientNet-L1EfficientNet-L2EfficientNet-L2EfficientNet-B75. Self-training was previously used to improve ResNet-50 from 76.4% to 81.2% top-1 accuracy[76] which is still far from the state-of-the-art accuracy. Self-Training With Noisy Student Improves ImageNet Classification Abstract: We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Semi-supervised medical image classification with relation-driven self-ensembling model. possible. We found that self-training is a simple and effective algorithm to leverage unlabeled data at scale. Code for Noisy Student Training. For smaller models, we set the batch size of unlabeled images to be the same as the batch size of labeled images. At the top-left image, the model without Noisy Student ignores the sea lions and mistakenly recognizes a buoy as a lighthouse, while the model with Noisy Student can recognize the sea lions. Training these networks from only a few annotated examples is challenging while producing manually annotated images that provide supervision is tedious. Noise Self-training with Noisy Student 1. mFR (mean flip rate) is the weighted average of flip probability on different perturbations, with AlexNets flip probability as a baseline. A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. to noise the student. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. The architecture specifications of EfficientNet-L0, L1 and L2 are listed in Table 7. You can also use the colab script noisystudent_svhn.ipynb to try the method on free Colab GPUs. Self-training with noisy student improves imagenet classification, in: Proceedings of the IEEE/CVF Conference on Computer . A number of studies, e.g. We then use the teacher model to generate pseudo labels on unlabeled images. After using the masks generated by teacher-SN, the classification performance improved by 0.2 of AC, 1.2 of SP, and 0.7 of AUC. This accuracy is 1.0% better than the previous state-of-the-art ImageNet accuracy which requires 3.5B weakly labeled Instagram images. Not only our method improves standard ImageNet accuracy, it also improves classification robustness on much harder test sets by large margins: ImageNet-A[25] top-1 accuracy from 16.6% to 74.2%, ImageNet-C[24] mean corruption error (mCE) from 45.7 to 31.2 and ImageNet-P[24] mean flip rate (mFR) from 27.8 to 16.1. Hence we use soft pseudo labels for our experiments unless otherwise specified. Papers With Code is a free resource with all data licensed under. Work fast with our official CLI. As a comparison, our method only requires 300M unlabeled images, which is perhaps more easy to collect. Noisy Student improves adversarial robustness against an FGSM attack though the model is not optimized for adversarial robustness. This work systematically benchmark state-of-the-art methods that use unlabeled data, including domain-invariant, self-training, and self-supervised methods, and shows that their success on WILDS is limited. Qizhe Xie, Eduard Hovy, Minh-Thang Luong, Quoc V. Le. A. Krizhevsky, I. Sutskever, and G. E. Hinton, Temporal ensembling for semi-supervised learning, Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks, Workshop on Challenges in Representation Learning, ICML, Certainty-driven consistency loss for semi-supervised learning, C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk, Improving robustness without sacrificing accuracy with patch gaussian augmentation, Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang, Smooth neighbors on teacher graphs for semi-supervised learning, L. Maale, C. K. Snderby, S. K. Snderby, and O. Winther, A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning models resistant to adversarial attacks, D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, Exploring the limits of weakly supervised pretraining, T. Miyato, S. Maeda, S. Ishii, and M. Koyama, Virtual adversarial training: a regularization method for supervised and semi-supervised learning, IEEE transactions on pattern analysis and machine intelligence, A. Najafi, S. Maeda, M. Koyama, and T. Miyato, Robustness to adversarial perturbations in learning from incomplete data, J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. V. Le, and R. Pang, Robustness properties of facebooks resnext wsl models, Adversarial dropout for supervised and semi-supervised learning, Lessons from building acoustic models with a million hours of speech, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, Deep co-training for semi-supervised image recognition, I. Radosavovic, P. Dollr, R. Girshick, G. Gkioxari, and K. He, Data distillation: towards omni-supervised learning, A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised learning with ladder networks, E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, Proceedings of the AAAI Conference on Artificial Intelligence, B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. On ImageNet-P, it leads to an mean flip rate (mFR) of 17.8 if we use a resolution of 224x224 (direct comparison) and 16.1 if we use a resolution of 299x299.111For EfficientNet-L2, we use the model without finetuning with a larger test time resolution, since a larger resolution results in a discrepancy with the resolution of data and leads to degraded performance on ImageNet-C and ImageNet-P. Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores. In contrast, changing architectures or training with weakly labeled data give modest gains in accuracy from 4.7% to 16.6%. Noisy Student can still improve the accuracy to 1.6%. Instructions on running prediction on unlabeled data, filtering and balancing data and training using the stored predictions. Self-training with Noisy Student improves ImageNet classification. We hypothesize that the improvement can be attributed to SGD, which introduces stochasticity into the training process. In terms of methodology, Z. Yalniz, H. Jegou, K. Chen, M. Paluri, and D. Mahajan, Billion-scale semi-supervised learning for image classification, Z. Yang, W. W. Cohen, and R. Salakhutdinov, Revisiting semi-supervised learning with graph embeddings, Z. Yang, J. Hu, R. Salakhutdinov, and W. W. Cohen, Semi-supervised qa with generative domain-adaptive nets, Unsupervised word sense disambiguation rivaling supervised methods, 33rd annual meeting of the association for computational linguistics, R. Zhai, T. Cai, D. He, C. Dan, K. He, J. Hopcroft, and L. Wang, Adversarially robust generalization just requires more unlabeled data, X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, Proceedings of the IEEE international conference on computer vision, Making convolutional networks shift-invariant again, X. Zhang, Z. Li, C. Change Loy, and D. Lin, Polynet: a pursuit of structural diversity in very deep networks, X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, Proceedings of the 20th International conference on Machine learning (ICML-03), Semi-supervised learning literature survey, University of Wisconsin-Madison Department of Computer Sciences, B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, Learning transferable architectures for scalable image recognition, Architecture specifications for EfficientNet used in the paper. The baseline model achieves an accuracy of 83.2. IEEE Transactions on Pattern Analysis and Machine Intelligence. In other words, using Noisy Student makes a much larger impact to the accuracy than changing the architecture. Noisy student-teacher training for robust keyword spotting, Unsupervised Self-training Algorithm Based on Deep Learning for Optical Hence, a question that naturally arises is why the student can outperform the teacher with soft pseudo labels. Lastly, we trained another EfficientNet-L2 student by using the EfficientNet-L2 model as the teacher. As shown in Table2, Noisy Student with EfficientNet-L2 achieves 87.4% top-1 accuracy which is significantly better than the best previously reported accuracy on EfficientNet of 85.0%. In our experiments, we observe that soft pseudo labels are usually more stable and lead to faster convergence, especially when the teacher model has low accuracy. We find that Noisy Student is better with an additional trick: data balancing. We start with the 130M unlabeled images and gradually reduce the number of images. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves We train our model using the self-training framework[59] which has three main steps: 1) train a teacher model on labeled images, 2) use the teacher to generate pseudo labels on unlabeled images, and 3) train a student model on the combination of labeled images and pseudo labeled images. Code is available at https://github.com/google-research/noisystudent. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. student is forced to learn harder from the pseudo labels. During the generation of the pseudo labels, the teacher is not noised so that the pseudo labels are as accurate as possible. In all previous experiments, the students capacity is as large as or larger than the capacity of the teacher model. Self-Training With Noisy Student Improves ImageNet Classification. ImageNet-A test set[25] consists of difficult images that cause significant drops in accuracy to state-of-the-art models. As stated earlier, we hypothesize that noising the student is needed so that it does not merely learn the teachers knowledge. It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed. Do imagenet classifiers generalize to imagenet? The performance drops when we further reduce it. For RandAugment, we apply two random operations with the magnitude set to 27. CLIP (Contrastive Language-Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning.The idea of zero-data learning dates back over a decade [^reference-8] but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. Our work is based on self-training (e.g.,[59, 79, 56]). The algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student. Due to duplications, there are only 81M unique images among these 130M images. This is an important difference between our work and prior works on teacher-student framework whose main goal is model compression. The pseudo labels can be soft (a continuous distribution) or hard (a one-hot distribution). [^reference-9] [^reference-10] A critical insight was to . The biggest gain is observed on ImageNet-A: our method achieves 3.5x higher accuracy on ImageNet-A, going from 16.6% of the previous state-of-the-art to 74.2% top-1 accuracy. [68, 24, 55, 22]. Their purpose is different from ours: to adapt a teacher model on one domain to another. Using self-training with Noisy Student, together with 300M unlabeled images, we improve EfficientNets[69] ImageNet top-1 accuracy to 87.4%. In our implementation, labeled images and unlabeled images are concatenated together and we compute the average cross entropy loss. But training robust supervised learning models is requires this step. Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. sign in See For each class, we select at most 130K images that have the highest confidence. We also list EfficientNet-B7 as a reference. But during the learning of the student, we inject noise such as data As noise injection methods are not used in the student model, and the student model was also small, it is more difficult to make the student better than teacher. Different kinds of noise, however, may have different effects. The total gain of 2.4% comes from two sources: by making the model larger (+0.5%) and by Noisy Student (+1.9%). https://arxiv.org/abs/1911.04252. Copyright and all rights therein are retained by authors or by other copyright holders. Are you sure you want to create this branch? As shown in Table3,4 and5, when compared with the previous state-of-the-art model ResNeXt-101 WSL[44, 48] trained on 3.5B weakly labeled images, Noisy Student yields substantial gains on robustness datasets. After testing our models robustness to common corruptions and perturbations, we also study its performance on adversarial perturbations. [50] used knowledge distillation on unlabeled data to teach a small student model for speech recognition. to use Codespaces. In our experiments, we use dropout[63], stochastic depth[29], data augmentation[14] to noise the student. The main difference between our work and these works is that they directly optimize adversarial robustness on unlabeled data, whereas we show that self-training with Noisy Student improves robustness greatly even without directly optimizing robustness.
Elvin Rodriguez Piano, Articles S