Blind Submission by November • SkipBERT: Efficient Inference with Shallow Layer Skipping
Meta Review of Paper714 by Area Chair Xd3p
The submission improves the efficiency of a BERT model by learning context-independent representations for chunks of text, and caching these representations. The model is trained by distilling from BERT. When combined with an early exit mechanism, the approach is shown to dramatically improve inference speed without losing performance.
The paper tackles a practically important problem with an interesting and novel method.
Empirical results are strong, with a 65% speedup at no cost in accuracy. Extensive ablations and analysis are included.
The reviewers agree that the paper is well written.
The paper has been significantly improved since an earlier submission.
Reviewers make several suggestions for additional tasks (particularly more challenging ones), architectures (such as GPT or BART), and analysis that could be added to further improve the paper. However, the paper is currently ready for publication.
Official Review of Paper714 by Reviewer hEfZ
SkipBERT: Efficient Inference with Shallow Layer Skipping
Paper present idea for efficient inference for the lower layers using precomputed and stored representations of text chunks in a precomputed lookup table (PLOT). Paper incorporate early exit mechanisms as an enhancement to skip redundant computation for later layers and did extensive evaluations on GLUE tasks
Extensive experiments on GLUE benchmarking with DistilBERT, BERT-PKD and TinyBERT, where SkipBERT is capable of accelerating inference by up to 65% without compromising GLUE score, or accelerating by 82% while retaining 95% accuracy. Extensive accuracy versus latency trade-offs discussed in the evaluations.
System can become brittle and introduces similar flaws as n-gram based language models. It seems that comparison for accuracy with 6 or 4 layer Tiny BERT is not same as the current SkipBERT model has almost twice the number of parameters ?
Overall, paper presents thorough analysis and experiments using PLOT and an interesting direction to improve latency with some sacrifice on accuracy. Would be good to see how it can be extended to GPT or BART style architectures but that is not the scope of the paper. Additionally, would be useful to see impact on zero-shot settings on more complex tasks such as NER, Summarization etc. from latency and accuracy standpoint
Official Review of Paper714 by Reviewer vC4P
This paper proposes an interesting approach for accelerating BERT inference. Specifically, the method precomputes contextualized token representations of shallow BERT layers. The representations are stored in a lookup table, and as a result, the model can simply obtain the outputs of shallow layers via ngram lookup. To make this approach possible, the shallow layers are restricted to take short chunks (n-gram) of local context as input for each token. This simplified BERT model (SkipBERT) is distilled from standard BERT using a two-stage distillation process. The proposed method achieves strong results on the GLUE benchmark, outperforming previous BERT distillation works.
Augmenting neural networks with external memory such as lookup tables is an interesting and important direction. This paper presents a novel idea for building efficient student model via representation lookup, and shows promising results of how it can be combined with distillation (and perhaps other compression techniques such as pruning) to achieve stronger accuracy-latency trade-off.
The paper is very-well written. The speed, memory analyses and ablation studies are enriching.
This resubmission version has addressed my questions (in Section 4.7.1).
This is a strong paper and I'd like to see it accepted.
N/A
N/A
Official Review of Paper714 by Reviewer BCvc
This paper presents an efficient variant of BERT called SkipBERT that skips the lower layers by precomputing smaller text chunks (e.g. tri-grams). It also uses early exits along with distillation to further improve inference efficiency. The evaluation results show over half latency reduction while maintaining tasks performance (e.g. GLUE scores).
The precompute n-grams idea is simple and can potentially benefit many applications. The authors show SkipBERT's effectiveness for many tasks.
The main evaluation results are comprehensive and show strong improvements over existing baselines. The ablation study is exhaustive and well explains the SkipBERT design decisions
SkipBERT training requires large-scale pretraining from scratch. This cost is even bigger when considering that the distillation process also requires pre-trained BERT. Is it possible to make SkipBERT work without pretraining (e.g., use pre-trained BERT to help SkipBERT training)? or how would pretraining help the performance?
The storage costs are unclear. Table 3 only presents how the replacing ratio of tri-grams with bi-grams affects the space costs, but not the storage costs in general.
Missing inference efficiency results for batching on GPUs. The paper describes batch size 1 as a common scenario, this is true for end-users. But for servers or large-scale deployment cases, it would be better to compare the GPU batching speedups.
Does SkipBERT build tri-grams hidden states look-up tables for each dataset and each task? Will using one such table on Wikipedia (or training corpus) work? If SkipBERT is dataset-specific, it will limit the application scenarios and add more storage costs. Adding a few sentences discussing this point would be useful.
The distillation in SkipBERT is a necessary component, this could be made better if discussed upfront (e.g. in the introduction). Without distillation, SkipBERT would drop much performance.
Minor suggestion, adding early exits is orthogonal to this work and it is not a contribution but rather a benefit that could be shown in the evaluation section (instead of highlighting in the abstract introduction).
Related work comments: it's unclear how the related work in the model compression section relates to SkipBERT. The section input-adaptive inference is essentially discussing early exiting while not discussing other input-adaptive works such as: Adaptive Input Representations for Neural Language Modeling, Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search
Official Review of Paper714 by Reviewer fRBS
This paper proposes a memoization based technique to improve the inference runtime of BERT for downstream applications. The paper proposes replacing the lower layers of BERT via local transformer layers that operate on n-grams of text rather than the complete sentence. This lower layer builds intermediate representation which is then fed to normal transformer layers that attend to all tokens instead of ngrams. This allows for two advantages -
- The outputs of ngrams can be precomputed, thus trading off larger memory for faster compute - precomputed ngrams can be performed using a simple lookup
- Methods like early exit combine well with this approach - Early exit try to reduce computation by reducing the number of layers executed. Normal transformers operate on non contextualized word embeddings, thus require a few layers to be executed before which early exit becomes attractive. These layers provide contextualized representation necessary to complete the task. SkipBERT by providing contextualized word embeddings using a lookup table can help make early exit attractive using far fewer layers than general architecture.
- Detailed break down of systems aspect of this method - memory used, latency of each part of the model, tradeoffs of memory and accuracy
- Good results when compared with recent methods in the domain of efficient inference
- Showing that memoization can work for bert like models can open interesting avenues of research in the future
- Detailed ablations to back various architecture choices for the SkipBERT model
Weakness -
- The paper lacks a more qualitative understanding of the results. It will be good to understand what kind of inputs does BERT_12 find easy to predict correctly and SkipBERT struggles with and vice versa.
- The ablations on impact of memory on accuracy should be done on more tougher tasks like CoLA or SQuAD to better understand these tradeoffs
- It will be good to see the impact of local+global attention on more complex tasks like coreference resolution or NER which generally require longer bidirectional context
Questions for the authors -
- Lines 349-350 - Why do you say "latency on modern GPUs is not sensitive to hidden size, but mainly depends on number of sequential operations"
- Is the latency in Table 1 calculated on a real GPU?
- How is that the latency reduction for SkipBERT is same as other iso-MAC models in Table 1. Every iso-MAC model should show some variation in runtime based on the size of the model.
- At what sequence length for inference does SkipBERT stop becoming attractive i.e. the cost of lookup start adding up?
- What is aggregate Text Chunks in table 2? If it represents equation 3, shouldn't it have MACs associated with it?
- In Table 6, SkipBERT has 6 layers of local context and 4 of global context. I am finding it hard to understand what does Lloc=8 means here? Skipping 8 layers? But there are only 4 global context layers.
Why is the SQuAD result not discussed in the main paper and part of the Appendix? Given the complexity of tasks, a detailed analysis of SQuAD should be part of the main paper.
Line 16 - "performance. An using..." -> "performance and using"
Supplementary Materials by Program Chairs
Supplementary Materials by Program Chairs