Blind Submission by November • BERT Learns to Teach: Knowledge Distillation with Meta Learning
Supplementary Materials by Program Chairs
Meta Review of Paper2286 by Area Chair oioS
This paper is about the model-agnostic meta learning approach for distilling BERT for various tasks. The paper is being reviewed a second time, by exactly the same set of reviewers. Two reviewers feel that their concerns have been adequately addressed in the revision, and the third feels that their concerns have not been addressed.
Two reviewers have increased their scores marginally, but on balance there is limited excitement on this paper. This seems to be in the category of borderline accepts in a top tier venue like ACL.
a simple approach for knowledge distillation in BERT with decent improvements in performance. extensive experiments. the paper is clear and well written.
- improve the clarity on the design of quiz set and its benefits (as mentioned by reviewer GVpZ)
- bring back the useful image from the previous version of the paper
- add qualitative analysis to further understand what exactly is being distilled.
- I do not agree with reviewer C4ei that computational costs are much worse and improvement limited. Yes, the improvement is only 0.5 F1 points, but computational cost is only 20% worse. It will be good to answer the reviewer's question by more such evaluations, and also by computing stat significance to verify whether 0.5 F1 pts are indeed significant on this dataset.
borderline accept if there is space at top venues like ACL/NAACL. Or other slightly lower tier venues like COLING or CONLL.
Official Review of Paper2286 by Reviewer GVpZ
This paper studies knowledge distillation and proposes a meta-learning based approach to update the teacher model together with the student. The teacher update is based on the student performance on a hold-out training subset, and the gradients are propagated to the teacher via meta-learning techniques. Experiments on the GLUE benchmark (and CV tasks) demonstrate the performance gain over straight-forward KD and other student-aware KD approaches like PKD and SFTN.
The paper is overall well written and easy to follow. The idea to employ an updating teacher model during knowledge distillation is well motivated.
The experiments are conducted more thoroughly compared to the last version. More experiments baselines are included, and more experiment settings are specified.
The design of the quiz set is somewhat unclear. Since the quiz set is sampled before the knowledge distillation process and remains static throughout the process, one major concern is that the quiz update cannot continuously provide valid supervision, especially in later stages of the learning. Since the teacher and the student are both optimized towards better performing the quiz, either directly or indirectly, the quiz performance might be competitive enough during early iterations. Thus, how the quiz update benefits the knowledge distillation in later iterations is doubtful. A dynamic quiz set selection might be a more intuitive strategy, but empirical studies are needed. Besides, I wonder whether the quiz set is in the end utilized for student model training since one-tenth of the training data might bring significant performance gains.
I noticed that a very insightful picture in the former version (Figure 2) is overlooked in the current draft. I think the presented learning dynamics of the student and teacher in that figure could shed light on student-aware approaches. The picture shows that at early stages, the teacher `sacrifices' task performances to better transfer the knowledge, which is consistent with the core motivation. However, when the teacher is surpassed by the student, it seems to forget the task knowledge already and the performance worsens constantly. The teacher at this stage probably could not transfer valid knowledge anymore, since itself cannot well solve the task. It might be an interesting topic to make the teacher re-discover its knowledge at later stages instead of keeping poor performances.
Please refer to the Pros and Cons stated above.
Official Review of Paper2286 by Reviewer hV3V
This paper presents a method for knowledge distillation via meta-learning and applies this approach to distilling/compressing the BERT model into a smaller one, while retaining performance on several GLUE tasks. Their approach directly optimizes the teacher network for transfer using meta-learning approaches.
- The paper is generally well-written and clear.
- The approach is reasonable and well-motivated.
- Experimental results are very comprehensive and the authors have compared to many recent baselines on many NLP tasks.
- There are some reasonable ablation studies that show that both components in their model are important.
I previously reviewed an older version of this paper and had 3 main concerns: 1. lack of runtime analysis since meta-learning methods can be slow, 2. issues in experimental design in comparing compression and performance, and 3. lack of qualitative analysis on what happens during distillation.
The authors have largely addressed point 1 with additional experiments and analysis where inference can still be fast.
For point 2 the authors justified their choice of compressing BERT from 110 to 66 M params, and also added additional comparisons across other original and compressed sizes. I think these additional results are useful and improve the paper. However, it would still be good to compare performance at varying levels of compression (e.g. performance - parameter plot across different models would be good) to see all results in one figure.
The additional experiments do somewhat address my previous weakness point 3, but it would still be good to add some qualitative analysis regarding what is being distilled/transferred from BERT to a smaller model, and where exactly the model loses information to account for the drop in performance (e.g. which category within each of the datasets is the performance drop biggest on?)
I have increased my score since my previous review of this paper.
see weaknesses above
the authors could add a discussion regarding this paper https://arxiv.org/pdf/2010.03058.pdf and the potential ethical implications that arise from distilling BERT
Official Review of Paper2286 by Reviewer C4ei
This paper presents a knowledge distillation method based on MAML. The paper's main aim is to address two drawbacks of existing knowledge distillation methods: (1) teacher being unaware of the capacity of the student, and (2) teacher not optimised for distillation.
The paper assumes the teacher network is trainable and proposes a simple technique that, in the MAML framework, applies "pilot updates" in the inner loop to make the teacher and student networks more compatible. Pilot updates are essentially MAML's original inner loop, but applied twice, the first time to update the teacher and the second time to update the student.
Experiments are performed on distilling BERT and evaluated on a number of tasks/datasets in the GLUE benchmark. The proposed method consistently outperforms the compared baselines on almost all tasks.
On the CV side, the proposed method is evaluated on distilling ResNet and VGG. The method achieves comparable performance with SOTA model CRD.
Addresses knowledge distillation, an important technique with wide application in machine learning, natural language processing and computer vision.
The proposed method achieves SOTA performance on a large number of tasks in NLP, in distilling BERT.
The presentation is clear, logical and easy to follow.
A comparative analysis on the tradeoff between model performance and computational costs.
The proposed method is quite simple. Its essence is the "pilot update" mechanism, which basically applies the inner loop of MAML twice, once to update the teacher, and once to update the student.
The achieved performance boost is moderate.
The comparative analysis in Table 2 shows that, with significantly more computational costs than the two compared baselines (PKD and ProKT), the model achieves a modest performance gain of about 0.5 absolute F1 points. This analysis, though very welcome, demonstrates that the proposed method suffers a large computational penalty for a small accuracy/F1 gain.
This paper is a resubmission from ARR August, and it has not been significantly improved. The only major addition is Table 2 that describes the tradeoff between computational cost and model performance. Therefore, the majority of reviewers' comments from ARR August still holds (I was one of the original reviewers).