BERT Learns to Teach: Knowledge Distillation with Meta Learning

Anonymous

BERT Learns to Teach: Knowledge Distillation with Meta Learning

Blind Submission by November • BERT Learns to Teach: Knowledge Distillation with Meta Learning

Anonymous

16 Nov 2021 (modified: 14 Jan 2022)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: We present Knowledge Distillation with Meta Learning (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better transfer knowledge to the student network (i.e., \textit{learning to teach}) with the feedback from the performance of the distilled student network in a meta learning framework. Moreover, we introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner in meta learning algorithms that focus on an improved inner-learner. Experiments on various benchmarks show that MetaDistil can yield significant improvements compared with traditional KD algorithms and is less sensitive to the choice of different student capacity and hyperparameters, facilitating the use of KD on different tasks and models.

Revealed to Wangchunshu Zhou, Canwen Xu, Julian McAuley

16 Nov 2021 (modified: 16 Nov 2021)ACL ARR 2021 November Submission

Authors:Wangchunshu Zhou, Canwen Xu, Julian McAuley

TL;DR: We propose MetaDistil, a knowledge distillation algorithm that optimize the teacher model for knowledge transferring and applied it to BERT compression, resulting in new state-of-the-art of task-specific BERT compression.

Previous URL: /forum?id=_giVW90_UAV

Previous PDF: pdf

Response PDF: pdf

Reviewer/Editor Reassignment Request: RtpL, 9fvY, oTdd

Reviewer/Editor Reassignment Justification:

Preprint: yes

Existing Preprints: https://arxiv.org/abs/2106.04570

Preferred Venue: ACL 2022

Consent: yes

Consent To Review: yes

[–][+]

Supplementary Materials by Program Chairs

ACL ARR 2021 November Program Chairs

14 Jan 2022ACL ARR 2021 November Paper2286 Supplementary MaterialsReaders: Program Chairs, Paper2286 Reviewers, Paper2286 Authors, Paper2286 Area Chairs

Previous URL: /forum?id=_giVW90_UAV

Previous PDF: pdf

Response PDF: pdf

Note From EiCs: These are the confidential supplementary materials of the submission. If you see no entries in this comment, this means there haven't been submitted any.

[–][+]

Meta Review of Paper2286 by Area Chair oioS

ACL ARR 2021 November Paper2286 Area Chair oioS

09 Jan 2022ACL ARR 2021 November Paper2286 Meta ReviewReaders: Paper2286 Senior Area Chairs, Paper2286 Area Chairs, Paper2286 Authors, Paper2286 Reviewers, Program Chairs

Metareview:

This paper is about the model-agnostic meta learning approach for distilling BERT for various tasks. The paper is being reviewed a second time, by exactly the same set of reviewers. Two reviewers feel that their concerns have been adequately addressed in the revision, and the third feels that their concerns have not been addressed.

Two reviewers have increased their scores marginally, but on balance there is limited excitement on this paper. This seems to be in the category of borderline accepts in a top tier venue like ACL.

Summary Of Reasons To Publish:

a simple approach for knowledge distillation in BERT with decent improvements in performance. extensive experiments. the paper is clear and well written.

Summary Of Suggested Revisions:

improve the clarity on the design of quiz set and its benefits (as mentioned by reviewer GVpZ)
bring back the useful image from the previous version of the paper
add qualitative analysis to further understand what exactly is being distilled.
I do not agree with reviewer C4ei that computational costs are much worse and improvement limited. Yes, the improvement is only 0.5 F1 points, but computational cost is only 20% worse. It will be good to answer the reviewer's question by more such evaluations, and also by computing stat significance to verify whether 0.5 F1 pts are indeed significant on this dataset.

Overall Assessment: 4 = There are minor points that may be revised

Suggested Venues:

borderline accept if there is space at top venues like ACL/NAACL. Or other slightly lower tier venues like COLING or CONLL.

Ethical Concernes:

none identified.

[–][+]

Official Review of Paper2286 by Reviewer GVpZ

ACL ARR 2021 November Paper2286 Reviewer GVpZ

28 Dec 2021ACL ARR 2021 November Paper2286 Official ReviewReaders: Program Chairs, Paper2286 Senior Area Chairs, Paper2286 Area Chairs, Paper2286 Reviewers, Paper2286 Authors

Paper Summary:

This paper studies knowledge distillation and proposes a meta-learning based approach to update the teacher model together with the student. The teacher update is based on the student performance on a hold-out training subset, and the gradients are propagated to the teacher via meta-learning techniques. Experiments on the GLUE benchmark (and CV tasks) demonstrate the performance gain over straight-forward KD and other student-aware KD approaches like PKD and SFTN.

Summary Of Strengths:

The paper is overall well written and easy to follow. The idea to employ an updating teacher model during knowledge distillation is well motivated.
The experiments are conducted more thoroughly compared to the last version. More experiments baselines are included, and more experiment settings are specified.

Summary Of Weaknesses:

The design of the quiz set is somewhat unclear. Since the quiz set is sampled before the knowledge distillation process and remains static throughout the process, one major concern is that the quiz update cannot continuously provide valid supervision, especially in later stages of the learning. Since the teacher and the student are both optimized towards better performing the quiz, either directly or indirectly, the quiz performance might be competitive enough during early iterations. Thus, how the quiz update benefits the knowledge distillation in later iterations is doubtful. A dynamic quiz set selection might be a more intuitive strategy, but empirical studies are needed. Besides, I wonder whether the quiz set is in the end utilized for student model training since one-tenth of the training data might bring significant performance gains.
I noticed that a very insightful picture in the former version (Figure 2) is overlooked in the current draft. I think the presented learning dynamics of the student and teacher in that figure could shed light on student-aware approaches. The picture shows that at early stages, the teacher `sacrifices' task performances to better transfer the knowledge, which is consistent with the core motivation. However, when the teacher is surpassed by the student, it seems to forget the task knowledge already and the performance worsens constantly. The teacher at this stage probably could not transfer valid knowledge anymore, since itself cannot well solve the task. It might be an interesting topic to make the teacher re-discover its knowledge at later stages instead of keeping poor performances.

Comments, Suggestions And Typos:

Please refer to the Pros and Cons stated above.

Overall Assessment: 3 = Good: This paper is of interest to the *ACL audience and could be published, but might not be appropriate for a top-tier publication venue. It would likely be a strong paper in a suitable workshop.

Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.

Best Paper: No

Replicability: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.

Datasets: 1 = No usable datasets submitted.

Software: 1 = No usable software released.

Author Identity Guess: 1 = I do not have even an educated guess about author identity.

[–][+]

Official Review of Paper2286 by Reviewer hV3V

ACL ARR 2021 November Paper2286 Reviewer hV3V

27 Dec 2021 (modified: 27 Dec 2021)ACL ARR 2021 November Paper2286 Official ReviewReaders: Program Chairs, Paper2286 Senior Area Chairs, Paper2286 Area Chairs, Paper2286 Reviewers, Paper2286 Authors

Paper Summary:

This paper presents a method for knowledge distillation via meta-learning and applies this approach to distilling/compressing the BERT model into a smaller one, while retaining performance on several GLUE tasks. Their approach directly optimizes the teacher network for transfer using meta-learning approaches.

Summary Of Strengths:

The paper is generally well-written and clear.
The approach is reasonable and well-motivated.
Experimental results are very comprehensive and the authors have compared to many recent baselines on many NLP tasks.
There are some reasonable ablation studies that show that both components in their model are important.

Summary Of Weaknesses:

I previously reviewed an older version of this paper and had 3 main concerns: 1. lack of runtime analysis since meta-learning methods can be slow, 2. issues in experimental design in comparing compression and performance, and 3. lack of qualitative analysis on what happens during distillation.

The authors have largely addressed point 1 with additional experiments and analysis where inference can still be fast.

For point 2 the authors justified their choice of compressing BERT from 110 to 66 M params, and also added additional comparisons across other original and compressed sizes. I think these additional results are useful and improve the paper. However, it would still be good to compare performance at varying levels of compression (e.g. performance - parameter plot across different models would be good) to see all results in one figure.

The additional experiments do somewhat address my previous weakness point 3, but it would still be good to add some qualitative analysis regarding what is being distilled/transferred from BERT to a smaller model, and where exactly the model loses information to account for the drop in performance (e.g. which category within each of the datasets is the performance drop biggest on?)

I have increased my score since my previous review of this paper.

Comments, Suggestions And Typos:

see weaknesses above

Overall Assessment: 3.5

Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design.

Best Paper: No

Replicability: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.

Datasets: 1 = No usable datasets submitted.

Software: 1 = No usable software released.

Author Identity Guess: 1 = I do not have even an educated guess about author identity.

Ethical Concerns:

the authors could add a discussion regarding this paper https://arxiv.org/pdf/2010.03058.pdf and the potential ethical implications that arise from distilling BERT

[–][+]

Official Review of Paper2286 by Reviewer C4ei

ACL ARR 2021 November Paper2286 Reviewer C4ei

14 Dec 2021ACL ARR 2021 November Paper2286 Official ReviewReaders: Program Chairs, Paper2286 Senior Area Chairs, Paper2286 Area Chairs, Paper2286 Reviewers, Paper2286 Authors

Paper Summary:

This paper presents a knowledge distillation method based on MAML. The paper's main aim is to address two drawbacks of existing knowledge distillation methods: (1) teacher being unaware of the capacity of the student, and (2) teacher not optimised for distillation.

The paper assumes the teacher network is trainable and proposes a simple technique that, in the MAML framework, applies "pilot updates" in the inner loop to make the teacher and student networks more compatible. Pilot updates are essentially MAML's original inner loop, but applied twice, the first time to update the teacher and the second time to update the student.

Experiments are performed on distilling BERT and evaluated on a number of tasks/datasets in the GLUE benchmark. The proposed method consistently outperforms the compared baselines on almost all tasks.

On the CV side, the proposed method is evaluated on distilling ResNet and VGG. The method achieves comparable performance with SOTA model CRD.

Summary Of Strengths:

Addresses knowledge distillation, an important technique with wide application in machine learning, natural language processing and computer vision.
The proposed method achieves SOTA performance on a large number of tasks in NLP, in distilling BERT.
The presentation is clear, logical and easy to follow.
A comparative analysis on the tradeoff between model performance and computational costs.

Summary Of Weaknesses:

The proposed method is quite simple. Its essence is the "pilot update" mechanism, which basically applies the inner loop of MAML twice, once to update the teacher, and once to update the student.
The achieved performance boost is moderate.
The comparative analysis in Table 2 shows that, with significantly more computational costs than the two compared baselines (PKD and ProKT), the model achieves a modest performance gain of about 0.5 absolute F1 points. This analysis, though very welcome, demonstrates that the proposed method suffers a large computational penalty for a small accuracy/F1 gain.

Comments, Suggestions And Typos:

This paper is a resubmission from ARR August, and it has not been significantly improved. The only major addition is Table 2 that describes the tradeoff between computational cost and model performance. Therefore, the majority of reviewers' comments from ARR August still holds (I was one of the original reviewers).

Overall Assessment: 3 = Good: This paper is of interest to the *ACL audience and could be published, but might not be appropriate for a top-tier publication venue. It would likely be a strong paper in a suitable workshop.

Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.

Best Paper: No

Replicability: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.

Datasets: 1 = No usable datasets submitted.

Software: 1 = No usable software released.

Author Identity Guess: 3 = From the contents of the submission itself, I know/can guess at least one author's name.