Anonymous
16 Nov 2021 (modified: 14 Jan 2022)ACL ARR 2021 November Blind SubmissionReaders: November, Senior Area Chairs, Area Chairs, Reviewers, Paper714 Authors
Abstract: In this paper, we propose SkipBERT to accelerate BERT inference by skipping the computation of shallow layers. To achieve this, our approach encodes small text chunks into independent representations, which are then materialized to approximate the shallow representation of BERT. Since the use of such approximation is inexpensive compared with transformer calculations, we leverage it to replace the shallow layers of BERT to skip their runtime overhead. With off-the-shelf early exit mechanisms, we also skip redundant computation from the highest few layers to further improve inference efficiency. Results on GLUE show that our approach can reduce latency by 65% without sacrificing performance. Using only two-layer transformer calculations, we can still maintain 95% accuracy of BERT.
Revealed to Jue WANG, Ke Chen, Gang Chen, Lidan Shou, Julian McAuley

14 Nov 2021 (modified: 15 Nov 2021)ACL ARR 2021 November Submission
TL;DR: We propose SkipBERT to accelerate BERT inference by skipping the computation of shallow layers.
Previous URL: /forum?id=YYqKz8dBpAi
Previous PDF:  pdf
Response PDF:  pdf
Software:  zip
Preprint: no
Preferred Venue: ACL 2022
Consent: yes
Consent To Review: yes

Reply Type:
Author:
Visible To:
Hidden From:
6 Replies
[–][+]

Supplementary Materials by Program Chairs

ACL ARR 2021 November Program Chairs
14 Jan 2022ACL ARR 2021 November Paper714 Supplementary MaterialsReaders: Program Chairs, Paper714 Reviewers, Paper714 Authors, Paper714 Area Chairs
Software:  zip
Previous URL: /forum?id=YYqKz8dBpAi
Previous PDF:  pdf
Response PDF:  pdf
Note From EiCs: These are the confidential supplementary materials of the submission. If you see no entries in this comment, this means there haven't been submitted any.
[–][+]

Meta Review of Paper714 by Area Chair Xd3p

ACL ARR 2021 November Paper714 Area Chair Xd3p
07 Jan 2022ACL ARR 2021 November Paper714 Meta ReviewReaders: Paper714 Senior Area Chairs, Paper714 Area Chairs, Paper714 Authors, Paper714 Reviewers, Program Chairs
Metareview:

The submission improves the efficiency of a BERT model by learning context-independent representations for chunks of text, and caching these representations. The model is trained by distilling from BERT. When combined with an early exit mechanism, the approach is shown to dramatically improve inference speed without losing performance.

Summary Of Reasons To Publish:

The paper tackles a practically important problem with an interesting and novel method.

Empirical results are strong, with a 65% speedup at no cost in accuracy. Extensive ablations and analysis are included.

The reviewers agree that the paper is well written.

The paper has been significantly improved since an earlier submission.

Summary Of Suggested Revisions:

Reviewers make several suggestions for additional tasks (particularly more challenging ones), architectures (such as GPT or BART), and analysis that could be added to further improve the paper. However, the paper is currently ready for publication.

Overall Assessment: 5 = The paper is largely complete and there are no clear points of revision
[–][+]

Official Review of Paper714 by Reviewer hEfZ

ACL ARR 2021 November Paper714 Reviewer hEfZ
30 Dec 2021ACL ARR 2021 November Paper714 Official ReviewReaders: Program Chairs, Paper714 Senior Area Chairs, Paper714 Area Chairs, Paper714 Reviewers, Paper714 Authors
Paper Summary:

SkipBERT: Efficient Inference with Shallow Layer Skipping

Paper present idea for efficient inference for the lower layers using precomputed and stored representations of text chunks in a precomputed lookup table (PLOT). Paper incorporate early exit mechanisms as an enhancement to skip redundant computation for later layers and did extensive evaluations on GLUE tasks

Summary Of Strengths:

Extensive experiments on GLUE benchmarking with DistilBERT, BERT-PKD and TinyBERT, where SkipBERT is capable of accelerating inference by up to 65% without compromising GLUE score, or accelerating by 82% while retaining 95% accuracy. Extensive accuracy versus latency trade-offs discussed in the evaluations.

Summary Of Weaknesses:

System can become brittle and introduces similar flaws as n-gram based language models. It seems that comparison for accuracy with 6 or 4 layer Tiny BERT is not same as the current SkipBERT model has almost twice the number of parameters ?

Comments, Suggestions And Typos:

Overall, paper presents thorough analysis and experiments using PLOT and an interesting direction to improve latency with some sacrifice on accuracy. Would be good to see how it can be extended to GPT or BART style architectures but that is not the scope of the paper. Additionally, would be useful to see impact on zero-shot settings on more complex tasks such as NER, Summarization etc. from latency and accuracy standpoint

Overall Assessment: 4 = Strong: This paper is of significant interest (for broad or narrow sub-communities), and warrants acceptance in a top-tier *ACL venue if space allows.
Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.
Best Paper: No
Replicability: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.
Datasets: 1 = No usable datasets submitted.
Software: 1 = No usable software released.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
[–][+]

Official Review of Paper714 by Reviewer vC4P

ACL ARR 2021 November Paper714 Reviewer vC4P
30 Dec 2021ACL ARR 2021 November Paper714 Official ReviewReaders: Program Chairs, Paper714 Senior Area Chairs, Paper714 Area Chairs, Paper714 Reviewers, Paper714 Authors
Paper Summary:

This paper proposes an interesting approach for accelerating BERT inference. Specifically, the method precomputes contextualized token representations of shallow BERT layers. The representations are stored in a lookup table, and as a result, the model can simply obtain the outputs of shallow layers via ngram lookup. To make this approach possible, the shallow layers are restricted to take short chunks (n-gram) of local context as input for each token. This simplified BERT model (SkipBERT) is distilled from standard BERT using a two-stage distillation process. The proposed method achieves strong results on the GLUE benchmark, outperforming previous BERT distillation works.

Summary Of Strengths:
  • Augmenting neural networks with external memory such as lookup tables is an interesting and important direction. This paper presents a novel idea for building efficient student model via representation lookup, and shows promising results of how it can be combined with distillation (and perhaps other compression techniques such as pruning) to achieve stronger accuracy-latency trade-off.

  • The paper is very-well written. The speed, memory analyses and ablation studies are enriching.

  • This resubmission version has addressed my questions (in Section 4.7.1).

This is a strong paper and I'd like to see it accepted.

Summary Of Weaknesses:

N/A

Comments, Suggestions And Typos:

N/A

Overall Assessment: 4 = Strong: This paper is of significant interest (for broad or narrow sub-communities), and warrants acceptance in a top-tier *ACL venue if space allows.
Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.
Best Paper: No
Replicability: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
Datasets: 1 = No usable datasets submitted.
Software: 3 = Potentially useful: Someone might find the new software useful for their work.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
[–][+]

Official Review of Paper714 by Reviewer BCvc

ACL ARR 2021 November Paper714 Reviewer BCvc
26 Dec 2021ACL ARR 2021 November Paper714 Official ReviewReaders: Program Chairs, Paper714 Senior Area Chairs, Paper714 Area Chairs, Paper714 Reviewers, Paper714 Authors
Paper Summary:

This paper presents an efficient variant of BERT called SkipBERT that skips the lower layers by precomputing smaller text chunks (e.g. tri-grams). It also uses early exits along with distillation to further improve inference efficiency. The evaluation results show over half latency reduction while maintaining tasks performance (e.g. GLUE scores).

Summary Of Strengths:
  • The precompute n-grams idea is simple and can potentially benefit many applications. The authors show SkipBERT's effectiveness for many tasks.

  • The main evaluation results are comprehensive and show strong improvements over existing baselines. The ablation study is exhaustive and well explains the SkipBERT design decisions

Summary Of Weaknesses:
  • SkipBERT training requires large-scale pretraining from scratch. This cost is even bigger when considering that the distillation process also requires pre-trained BERT. Is it possible to make SkipBERT work without pretraining (e.g., use pre-trained BERT to help SkipBERT training)? or how would pretraining help the performance?

  • The storage costs are unclear. Table 3 only presents how the replacing ratio of tri-grams with bi-grams affects the space costs, but not the storage costs in general.

  • Missing inference efficiency results for batching on GPUs. The paper describes batch size 1 as a common scenario, this is true for end-users. But for servers or large-scale deployment cases, it would be better to compare the GPU batching speedups.

Comments, Suggestions And Typos:
  • Does SkipBERT build tri-grams hidden states look-up tables for each dataset and each task? Will using one such table on Wikipedia (or training corpus) work? If SkipBERT is dataset-specific, it will limit the application scenarios and add more storage costs. Adding a few sentences discussing this point would be useful.

  • The distillation in SkipBERT is a necessary component, this could be made better if discussed upfront (e.g. in the introduction). Without distillation, SkipBERT would drop much performance.

  • Minor suggestion, adding early exits is orthogonal to this work and it is not a contribution but rather a benefit that could be shown in the evaluation section (instead of highlighting in the abstract introduction).

  • Related work comments: it's unclear how the related work in the model compression section relates to SkipBERT. The section input-adaptive inference is essentially discussing early exiting while not discussing other input-adaptive works such as: Adaptive Input Representations for Neural Language Modeling, Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search

Overall Assessment: 3.5
Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design.
Best Paper: No
Replicability: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
Datasets: 1 = No usable datasets submitted.
Software: 1 = No usable software released.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
[–][+]

Official Review of Paper714 by Reviewer fRBS

ACL ARR 2021 November Paper714 Reviewer fRBS
23 Dec 2021ACL ARR 2021 November Paper714 Official ReviewReaders: Program Chairs, Paper714 Senior Area Chairs, Paper714 Area Chairs, Paper714 Reviewers, Paper714 Authors
Paper Summary:

This paper proposes a memoization based technique to improve the inference runtime of BERT for downstream applications. The paper proposes replacing the lower layers of BERT via local transformer layers that operate on n-grams of text rather than the complete sentence. This lower layer builds intermediate representation which is then fed to normal transformer layers that attend to all tokens instead of ngrams. This allows for two advantages -

  1. The outputs of ngrams can be precomputed, thus trading off larger memory for faster compute - precomputed ngrams can be performed using a simple lookup
  2. Methods like early exit combine well with this approach - Early exit try to reduce computation by reducing the number of layers executed. Normal transformers operate on non contextualized word embeddings, thus require a few layers to be executed before which early exit becomes attractive. These layers provide contextualized representation necessary to complete the task. SkipBERT by providing contextualized word embeddings using a lookup table can help make early exit attractive using far fewer layers than general architecture.
Summary Of Strengths:
  1. Detailed break down of systems aspect of this method - memory used, latency of each part of the model, tradeoffs of memory and accuracy
  2. Good results when compared with recent methods in the domain of efficient inference
  3. Showing that memoization can work for bert like models can open interesting avenues of research in the future
  4. Detailed ablations to back various architecture choices for the SkipBERT model
Summary Of Weaknesses:

Weakness -

  1. The paper lacks a more qualitative understanding of the results. It will be good to understand what kind of inputs does BERT_12 find easy to predict correctly and SkipBERT struggles with and vice versa.
  2. The ablations on impact of memory on accuracy should be done on more tougher tasks like CoLA or SQuAD to better understand these tradeoffs
  3. It will be good to see the impact of local+global attention on more complex tasks like coreference resolution or NER which generally require longer bidirectional context

Questions for the authors -

  1. Lines 349-350 - Why do you say "latency on modern GPUs is not sensitive to hidden size, but mainly depends on number of sequential operations"
  2. Is the latency in Table 1 calculated on a real GPU?
  3. How is that the latency reduction for SkipBERT is same as other iso-MAC models in Table 1. Every iso-MAC model should show some variation in runtime based on the size of the model.
  4. At what sequence length for inference does SkipBERT stop becoming attractive i.e. the cost of lookup start adding up?
  5. What is aggregate Text Chunks in table 2? If it represents equation 3, shouldn't it have MACs associated with it?
  6. In Table 6, SkipBERT has 6 layers of local context and 4 of global context. I am finding it hard to understand what does Lloc=8 means here? Skipping 8 layers? But there are only 4 global context layers.
Comments, Suggestions And Typos:
  1. Why is the SQuAD result not discussed in the main paper and part of the Appendix? Given the complexity of tasks, a detailed analysis of SQuAD should be part of the main paper.

  2. Line 16 - "performance. An using..." -> "performance and using"

Overall Assessment: 3.5
Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.
Best Paper: No
Replicability: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.
Datasets: 1 = No usable datasets submitted.
Software: 1 = No usable software released.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.