============================================================================ EMNLP 2021 Reviews for Submission #3143 ============================================================================ Title: Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression Authors: Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian McAuley and Furu Wei ============================================================================ META-REVIEW ============================================================================ Comments: This paper argues for evaluating model compression techniques (e.g. knowledge distillation, quantisation) in terms of loyalty, i.e. To what extent does the student model produce similar predictions as the teacher model? This evaluation protocol goes beyond the standard practice of evaluating the compressed model based on overall task accuracy alone, which may not necessarily reflect how different the student and teacher models' predictions are. Reviewers appreciate the usefulness of the findings for the community (where knowledge distillation and quantisation are widely used for model compression), and also the extent of the experiments, which are fairly comprehensive given the short paper page limit. Reviewers also find the paper to be clear and well-written. Overall, I think the findings of the paper can be useful for model compression research going forward, where knowledge distillation and quantisation are getting increasingly important given the ever-growing size of NLP models. There are some concerns regarding the novelty (KL divergence between teacher & student are already widely used for knowledge distillation training, although not necessarily as an evaluation metric on downstream tasks), and also regarding the connection with KL divergence & JS divergence. However, I do not find both of these issues to be particularly pressing, since short paper submissions welcome focused, narrower contributions, and the! paper clearly describes how the loyalty metric relate to JS divergence. For future iterations of the paper, I recommend incorporating the reviewer feedback into account, particularly by broadening the empirical scope to more tasks & datasets. ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper argues that accuracy alone is insufficient to evaluate model compression methods (distillation, pruning, quantization). It suggests we should also evaluate loyalty (are predictions/prediction probabilities similar to the base model) and robustness to adversarial attacks as additional metrics. The paper computes those metrics for 9 compression methods on MNLI. It finds that different methods offer different advantages, for instance that post-training quantization reduces adversarial attacks considerably, though with a bigger drop in accuracy than other compression methods. Finally, the paper also computes those metrics for combinations of compression methods. The general premise of the paper is convincing: there are possible reasons why accuracy alone might not reflect the effectiveness of compression methods (1. different biases for the compressed models, 2. inability to distinguish between the impact of the compression method and/or additional data/augmentation used, 3. lower robustness). The metrics introduced also make some sense: loyalty metrics ensure the student has similar predictions to the teacher (mostly aimed at measuring 1. and 2.) and the adversarial attack metric measures robustness (measures 3.). Anecdotally, I have seen loyalty metrics being used in practice in various production contexts, though they are indeed missing from most of the compression methods this paper studies. Pros: - The paper is well-written overall. The wording is clear, the section progression makes sense, - The experimental setup is clear and hyperparameter details are present. It is also nice to see stdevs for Table 1. - The paper yields some interesting (although sometimes “intuitive†) findings and formulates useful recommendations. The recommendations are welcome, given it is easy to add more metrics to track (there is often some unmeasured aspect of a model) but hard to make decisions in the presence of many metrics (unless a model is not on the Pareto-frontier). Cons: - Messaging could be improved a bit. To me they are two main messages: 1/ We have new metrics that are useful to evaluate important but underemphasized practical aspects of compression methods. Here's why they are useful: 2/ Using our metrics, here are some recommendations about which compression methods to use. Currently, abstract/intro is mostly about 1/ and the rest of the paper mostly abou 2/. I would try to talk about both aspects throughout. - The authors comment on the label loyalty metric being important to distinguish improvements that come from extra data/augmentation, but it is not shown in the paper. TinyBERT is the only model that does extra augmentation (at fine-tuning time, not 100% clear the authors also do this), but its loyalty metrics are equal to DistilBERT. It would be nice to see a clearer case. - There are potentially many different adversarial attacks / robustness issues with models. I.e PTQ is very good against TextFooler’s but might be worse against other attacks. This should be stated more clearly as the general statement that PTQ is more robust feels too strong with the presented evidence. - Minor: The argument used to justify the use of the probability loyalty metric is interesting (calibration + changing the probability distribution can lead to wrong results in early exiting pipelines). However, for the latter, it would be nice to see how often it happens in practice (e.g: label-loyalty w/ early exiting - label-loyalty). Overall, it would be nice to see stronger evidence that probability loyalty is important to keep track of (knowing that tracking too many metrics makes decisions difficult). - There is somewhat limited novelty. This is not a dealbreaker, the paper and its empirical results are still useful to the community. ~~~ Author response edit ~~~ Left my grade unchanged. I believe most of my points still stand, both on the pros and cons. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- - This is a sound recommendation to go beyond accuracy in evaluating compression methods. Publishing this paper will incentivize the community to have a healthier approach to compression paper evaluation. - The tables regarding the strengths of the different methods have useful findings per se. Same goes for the trick combination ones. Overall, this is not transformative but likely to be a useful resource to the community so it should be accepted. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- The limited novelty of the paper could be an argument, though I do not thing it is a dealbreaker for me. The paper would still be useful to the community. --------------------------------------------------------------------------- Questions for the Author(s) --------------------------------------------------------------------------- N/A --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- N/A --------------------------------------------------------------------------- Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- fintuned -> finetuned --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 5 Ethical Concerns: No Overall Recommendation - Short Paper: 3.5 ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper mainly focuses on an important and meaningful area of model compression: how to evaluate the well-trained student (small) model. In addition to the label accuracy which has been widely used, this paper introduces some other metrics to evaluate the student model, including label loyalty, probability loyalty, and robustness to adversarial attacks. This paper also conducts experiments on the MNLI dataset to show the results of seven existing model compression techniques. Based on experiments, this paper empirically claims that (i) knowledge-distillation-based methods usually have a higher loyalty while quantization-based methods usually have better robustness; (ii) this paper recommends the compression strategy in two steps that firstly conduct pruning or module replacing with a KD loss and then apply post-training quantization for speed-sensitive and robustness-sensitive applications. Strengths: 1. This paper focuses on an important and meaningful area of model compression, which lacks publications to the community. Meanwhile, this paper is well-written and easy to read. 2. This paper conducts detailed experiments (as a short paper) to show how to evaluate different widely used model compression techniques. 3. It seems easy to follow this paper because most of the implementation details are given. 4. In addition to the proposed metrics, this paper also provides some useful recommendations to model compression procedures. Weaknesses: 1. In my opinion, it may be better if this paper can explore more deep findings on the model compression. In other words, it may be better if this paper can be expanded as a long paper to show more new things to the community. For example, this paper proposes interesting recommendations to model compression procedures, but what happens if we follow these recommendations? This may not a major weakness to a short paper, but I am looking forward to learning more according to this direction. 2. It would be better to show more results on other datasets to show the generalization of loyalty and robustness. 3. In person, I think loyalty is not a perfect metric to evaluate the student model. This metric is highly relevant to the teacher model. Empirically, some student models may predict a correct label on downstream tasks while the teacher model cannot predict correctly. Therefore, this situation hurts the loyalty score while I think it should be encouraged. Because this situation means that the student model can not only learn from the teacher model, but it may also have the ability to rectify the teacher model. After Response: Although the author's response helps me better understand the whole paper, I think the score is accurate and therefore I will not change my score. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Same as the Strengths. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- Same as the Weaknesses. --------------------------------------------------------------------------- Questions for the Author(s) --------------------------------------------------------------------------- 1. It is undoubtedly that only accuracy is not enough. But, why do you select loyalty to evaluate the student models, and why not other metrics? 2. In experiments, BERT-Base is used as the teacher model. What happens if other PLMs are used as the teacher model? Are the conclusions remains? 3. As shown in Table 2, model replacing (BERT-of-Theseus) has poor performance comparing with other techniques, are there any further possible explanations? --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 5 Ethical Concerns: No Overall Recommendation - Short Paper: 3.5 ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper is motivated by the drawbacks of a widely used evaluation metric for compressed BERT models. Preserved accuracy is not a good measure for indicating how the compressed model behaves. The author(s) propose(s) two metrics, label loyalty and probability loyalty, to measure how closely the compressed model mimics the original model. Experiments on BERT compression with different methods show that the combination of knowledge distillation loss and post-training quantization yields better results. Strengths: The motivation is clear. Accuracy cannot be directly optimized, so we usually use other terms of objectives as the surrogate. The proposed strategy achieves better results on BERT compression. Weaknesses: Actually, the similarity of two distributions are widely used as the evaluation metric instead of accuracy, so the novelty is limited. Despite the numerical improvements (higher accuracy or similarity), the proposed method does not solve the aforementioned issues, e.g., the correlation between loyalty and robustness is quite weak (Table 1). Since this paper lacks the comparison between proposed metrics with the widely used KL-divergence or JS-divergence, the conclusion is far away from convincing. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- 1. The motivation is clear. Accuracy cannot be directly optimized, so we usually use other terms of objectives as the surrogate. 2. The proposed strategy achieves better results on BERT compression. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- 1. Actually, the similarity of two distributions are widely used as the evaluation metric instead of accuracy, so the novelty is limited. 2. Despite the numerical improvements (higher accuracy or similarity), the proposed method does not solve the aforementioned issues, e.g., the correlation between loyalty and robustness is quite weak (Table 1). 3. Since this paper lacks the comparison between proposed metrics with the widely used KL-divergence or JS-divergence, the conclusion is far away from convincing. --------------------------------------------------------------------------- Questions for the Author(s) --------------------------------------------------------------------------- Did you compare the loyalty with the original KL-divergence, JS-divergence, or other metrics beyond accuracy? It would be better to add more columns in your tables. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 3 Ethical Concerns: No Overall Recommendation - Short Paper: 2.5