self training with noisy student improves imagenet classification

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Temporal ensembling for semi-supervised learning, Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks, Workshop on Challenges in Representation Learning, ICML, Certainty-driven consistency loss for semi-supervised learning, C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk, Improving robustness without sacrificing accuracy with patch gaussian augmentation, Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang, Smooth neighbors on teacher graphs for semi-supervised learning, L. Maale, C. K. Snderby, S. K. Snderby, and O. Winther, A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning models resistant to adversarial attacks, D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, Exploring the limits of weakly supervised pretraining, T. Miyato, S. Maeda, S. Ishii, and M. Koyama, Virtual adversarial training: a regularization method for supervised and semi-supervised learning, IEEE transactions on pattern analysis and machine intelligence, A. Najafi, S. Maeda, M. Koyama, and T. Miyato, Robustness to adversarial perturbations in learning from incomplete data, J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. V. Le, and R. Pang, Robustness properties of facebooks resnext wsl models, Adversarial dropout for supervised and semi-supervised learning, Lessons from building acoustic models with a million hours of speech, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, Deep co-training for semi-supervised image recognition, I. Radosavovic, P. Dollr, R. Girshick, G. Gkioxari, and K. He, Data distillation: towards omni-supervised learning, A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised learning with ladder networks, E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, Proceedings of the AAAI Conference on Artificial Intelligence, B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. It is found that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. Authors: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le Description: We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Distillation Survey : Noisy Student | 9to5Tutorial Also related to our work is Data Distillation[52], which ensembled predictions for an image with different transformations to teach a student network. There was a problem preparing your codespace, please try again. The main use case of knowledge distillation is model compression by making the student model smaller. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. on ImageNet ReaL. Self-Training With Noisy Student Improves ImageNet Classification Their main goal is to find a small and fast model for deployment. Self-training with Noisy Student improves ImageNet classification As we use soft targets, our work is also related to methods in Knowledge Distillation[7, 3, 26, 16]. Imaging, 39 (11) (2020), pp. In particular, we first perform normal training with a smaller resolution for 350 epochs. (using extra training data). Noisy Student Training is based on the self-training framework and trained with 4-simple steps: This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. For more information about the large architectures, please refer to Table7 in Appendix A.1. This accuracy is 1.0% better than the previous state-of-the-art ImageNet accuracy which requires 3.5B weakly labeled Instagram images. Noise Self-training with Noisy Student 1. In Noisy Student, we combine these two steps into one because it simplifies the algorithm and leads to better performance in our preliminary experiments. On . Although the images in the dataset have labels, we ignore the labels and treat them as unlabeled data. In particular, we set the survival probability in stochastic depth to 0.8 for the final layer and follow the linear decay rule for other layers. We iterate this process by putting back the student as the teacher. We apply RandAugment to all EfficientNet baselines, leading to more competitive baselines. As can be seen from Table 8, the performance stays similar when we reduce the data to 116 of the total data, which amounts to 8.1M images after duplicating. We used the version from [47], which filtered the validation set of ImageNet. , have shown that computer vision models lack robustness. . 10687-10698). SelfSelf-training with Noisy Student improves ImageNet classification Do imagenet classifiers generalize to imagenet? On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. For classes where we have too many images, we take the images with the highest confidence. CLIP (Contrastive Language-Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning.The idea of zero-data learning dates back over a decade [^reference-8] but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. We then train a student model which minimizes the combined cross entropy loss on both labeled images and unlabeled images. These significant gains in robustness in ImageNet-C and ImageNet-P are surprising because our models were not deliberately optimizing for robustness (e.g., via data augmentation). Use Git or checkout with SVN using the web URL. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Self-Training With Noisy Student Improves ImageNet Classification Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. Le, and J. Shlens, Using videos to evaluate image model robustness, Deep residual learning for image recognition, Benchmarking neural network robustness to common corruptions and perturbations, D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, Distilling the knowledge in a neural network, G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, G. Huang, Y. These works constrain model predictions to be invariant to noise injected to the input, hidden states or model parameters. The performance drops when we further reduce it. During the generation of the pseudo labels, the teacher is not noised so that the pseudo labels are as accurate as possible. [50] used knowledge distillation on unlabeled data to teach a small student model for speech recognition. Self-training with Noisy Student improves ImageNet classification Noisy Student Training is a semi-supervised learning method which achieves 88.4% top-1 accuracy on ImageNet (SOTA) and surprising gains on robustness and adversarial benchmarks. Infer labels on a much larger unlabeled dataset. This is probably because it is harder to overfit the large unlabeled dataset. Here we show an implementation of Noisy Student Training on SVHN, which boosts the performance of a Here we use unlabeled images to improve the state-of-the-art ImageNet accuracy and show that the accuracy gain has an outsized impact on robustness. We use the standard augmentation instead of RandAugment in this experiment. [^reference-9] [^reference-10] A critical insight was to . Finally, frameworks in semi-supervised learning also include graph-based methods [84, 73, 77, 33], methods that make use of latent variables as target variables [32, 42, 78] and methods based on low-density separation[21, 58, 15], which might provide complementary benefits to our method. The most interesting image is shown on the right of the first row. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. PDF Self-Training with Noisy Student Improves ImageNet Classification Why Self-training with Noisy Students beats SOTA Image classification The pseudo labels can be soft (a continuous distribution) or hard (a one-hot distribution). Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. Z. Yalniz, H. Jegou, K. Chen, M. Paluri, and D. Mahajan, Billion-scale semi-supervised learning for image classification, Z. Yang, W. W. Cohen, and R. Salakhutdinov, Revisiting semi-supervised learning with graph embeddings, Z. Yang, J. Hu, R. Salakhutdinov, and W. W. Cohen, Semi-supervised qa with generative domain-adaptive nets, Unsupervised word sense disambiguation rivaling supervised methods, 33rd annual meeting of the association for computational linguistics, R. Zhai, T. Cai, D. He, C. Dan, K. He, J. Hopcroft, and L. Wang, Adversarially robust generalization just requires more unlabeled data, X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, Proceedings of the IEEE international conference on computer vision, Making convolutional networks shift-invariant again, X. Zhang, Z. Li, C. Change Loy, and D. Lin, Polynet: a pursuit of structural diversity in very deep networks, X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, Proceedings of the 20th International conference on Machine learning (ICML-03), Semi-supervised learning literature survey, University of Wisconsin-Madison Department of Computer Sciences, B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, Learning transferable architectures for scalable image recognition, Architecture specifications for EfficientNet used in the paper. Most existing distance metric learning approaches use fully labeled data Self-training achieves enormous success in various semi-supervised and For example, with all noise removed, the accuracy drops from 84.9% to 84.3% in the case with 130M unlabeled images and drops from 83.9% to 83.2% in the case with 1.3M unlabeled images. Summarization_self-training_with_noisy_student_improves_imagenet_classification. combination of labeled and pseudo labeled images. In contrast, the predictions of the model with Noisy Student remain quite stable. Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. They did not show significant improvements in terms of robustness on ImageNet-A, C and P as we did. Self-training with Noisy Student improves ImageNet classificationCVPR2020, Codehttps://github.com/google-research/noisystudent, Self-training, 1, 2Self-training, Self-trainingGoogleNoisy Student, Noisy Studentstudent modeldropout, stochastic depth andaugmentationteacher modelNoisy Noisy Student, Noisy Student, 1, JFT3ImageNetEfficientNet-B00.3130K130K, EfficientNetbaseline modelsEfficientNetresnet, EfficientNet-B7EfficientNet-L0L1L2, batchsize = 2048 51210242048EfficientNet-B4EfficientNet-L0l1L2350epoch700epoch, 2EfficientNet-B7EfficientNet-L0, 3EfficientNet-L0EfficientNet-L1L0, 4EfficientNet-L1EfficientNet-L2, student modelNoisy, noisystudent modelteacher modelNoisy, Noisy, Self-trainingaugmentationdropoutstochastic depth, Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores., 12/self-training-with-noisy-student-f33640edbab2, EfficientNet-L0EfficientNet-B7B7, EfficientNet-L1EfficientNet-L0, EfficientNetsEfficientNet-L1EfficientNet-L2EfficientNet-L2EfficientNet-B75.