UAI (Accept) Reviewer #1 Questions 1. Summary and contributions: Please summarize the paper’s motivation and key contributions in a few lines. Please see https://www.auai.org/uai2021/reviewing_instructions#Q1 for more information. The authors propose a new way to stabilize the signal propagation through a deep network. They suggest to initialize the weight of the residual connection to zero and trainable afterwards. 2. Main strengths: Please describe the main strengths of the work, considering primarily the following axes: originality/novelty, significance/impact, soundness/technical quality, reproducibility and clarity of writing. Please mention as many main strengths as there are, but avoid minor points here. For more information about what we are looking for, please see https://www.auai.org/uai2021/reviewing_instructions#Q2 - Empirical results on several models such as CNNs, fully-connected, transformers show some improved convergence behaviors. 3. Main weaknesses: Please describe the main weaknesses of the work, considering the same axes as in Q2: novelty/originality, impact/significance, soundness/technical quality, reproducibility, and clarity of writing. Please mention as many main weaknesses as there are, but avoid minor points here. For more information about what we are looking for, please see https://www.auai.org/uai2021/reviewing_instructions#Q3 - Method is not well-justified. There is only a vague description on a toy example. See detailed comments below. - Not clear to me where the improvements come from. See comment "decoupling initialization and trainable parameter" below. 4. Detailed comments to the authors: Please provide constructive criticism and feedback that could help improve the work or its presentation (e.g., presentation suggestions, missing references, minor mistakes and typos or grammar improvements). You may also include questions to the author here. Please see https://www.auai.org/uai2021/reviewing_instructions#Q4 for more information. - Justification The authors justify their approach on a very simple toy example (section 3.1). I find the description not very detailed and unclear. At the very least, I think it's important this part be improved. One could perhaps also try to develop a more concrete theory. The authors could for instance check this paper where they also rely on a zero initialization for a ResNet model (linear network): https://openreview.net/forum?id=HJxEhREKDH - "assuming a sufficiently well-conditioned cost function, the first step of gradient descent will update the residual weights α to a value that avoids large outputs and keeps the parameter trajectory within a well-conditioned region while retaining the expressive power of the network." This description is very imprecise, what do you mean by well-conditioned enough? Why would the trajectory stay in this well-conditioned region? - Algorithm box Please add a complete description of the algorithm. There is a description in section 3.1, but some details are missing (e.g. choice of constants) and this is for a 1d example. - Optimizer In figure 5, you say you are training with Adagrad. You also mention SGD with momentum in the appendix. It would be more consistent to use the same optimizer for all your experiments. - Decoupling initialization and trainable parameter From the experiments, it seems unclear if the gains come from the zero initialization or making alpha trainable. The authors do compare to a method called "ZERO gamma", which is said to learn gamma. Is gamma playing the same role as alpha (equation 8 in your paper)? If so, does this mean that most of the gains come from the zero initialization? - "we discovered interesting patterns in the values of residual weights of each layer |α i | over the course of training" This is actually one aspect I wish had been explored in the paper. The method does seem to bring some improvements, but it's not completely clear where it comes from (see also comment above) 5. Overall score: Please select the category that best describes your overall assessment of the paper. Please see https://www.auai.org/uai2021/reviewing_instructions#Q5 for more information. Weak Reject: Borderline paper, tending to reject 6. Justification for your score: Please explain in a few lines how you arrived at your overall assessment. Which aspects mentioned under main strengths and main weaknesses (Q2 & Q3) did you weigh most heavily and why? The experimental results are run on various models (CNNs, fully-connected, transformers, ...) and do show some improvements. However, I find that some aspects of the paper are unclear about the implementation and it is not clear to me where the improvements come from. I will of course consider raising my score if my concerns are addressed in the rebuttal. 7. Confidence in your score: Please rate your confidence in your assessment. You are fairly confident that the evaluation is correct. Reviewer #2 Questions 1. Summary and contributions: Please summarize the paper’s motivation and key contributions in a few lines. Please see https://www.auai.org/uai2021/reviewing_instructions#Q1 for more information. This paper is a fundamental research on building deep neural networks with faster convergence speeds. This paper proposes ReZero, a simple modification of deep neural networks that ensures initial dynamical isometry (i.e., all singular value of the Jacobian concentrate near 1), which is alike x_i+1=x_i+alpha_iF(x_i) and alpha=0 at the beginning of training.Architecture agnostic (applicable to many representative networks), deeper learning, and faster convergence are the major contributions. 2. Main strengths: Please describe the main strengths of the work, considering primarily the following axes: originality/novelty, significance/impact, soundness/technical quality, reproducibility and clarity of writing. Please mention as many main strengths as there are, but avoid minor points here. For more information about what we are looking for, please see https://www.auai.org/uai2021/reviewing_instructions#Q2 Main strengths: 1. Clear motivation, illustration of novel idea in ReZero, clear comparison with baseline methods such as residual connection, normalization methods, and SKIPINIT (one of the closest researches); 2. Clear and nice contributions of ReZero, in terms of network architecture agnostic, much deeper learning supported and faster convergences ensured – bringing an insightful impact to the field of deep learning (both CV and NLP and other application fields); 3. Detailed theoretical and experimental analysis to support the proposal. 3. Main weaknesses: Please describe the main weaknesses of the work, considering the same axes as in Q2: novelty/originality, impact/significance, soundness/technical quality, reproducibility, and clarity of writing. Please mention as many main weaknesses as there are, but avoid minor points here. For more information about what we are looking for, please see https://www.auai.org/uai2021/reviewing_instructions#Q3 Main weaknesses: 1. (Table 5’s results on CIFAR-10 is fine and also) willing to see end-to-end improvements of large-scale application tasks, such as ImageNet, SuperGLUE tasks; 4. Detailed comments to the authors: Please provide constructive criticism and feedback that could help improve the work or its presentation (e.g., presentation suggestions, missing references, minor mistakes and typos or grammar improvements). You may also include questions to the author here. Please see https://www.auai.org/uai2021/reviewing_instructions#Q4 for more information. Detailed comments and questions: 1. Table 3, not quite understand how significant it is from 64-layer’s 1.11 to 128-layer’s 1.08. Transformer is actually an encoder-decoder structure, so here you mean only using the “encoder” part of transformer? (since I am wondering why there are only 101M parameters here which is too small for a 128-layer transformer – if you check the transformer’s paper, you will see that only 6-layer transformer will with much larger number of parameters). 2. [citing formats] 256 GPUs [Microsoft], better give a year here for their Turing-LM. Also, several places have quite similar issues of citing: in “Abstract”, (Pennington et al.) better add year here. Also, in Section 2.1, (Pennigton et al. [Pennington et al., 2017, 2018]) this seems to be a typo here and I think latex has ways for you to cite (e.g., try \citep{}). In addition, in Section 4.1, in detail Yang and Schoenholz [2017], possibly a \cite{} is better here. In D.5 FixUp, He et al. He et al. [2015] -> can be updated. And Also D.1’s most citing. Possibly an overall checking of your citing is better. 3. Figure 1, to align with your equation (1), is the alpha in the figure starts from alpha_0? And finally to alpha_{L-1}? 4. Figure 4 is awesome to understand the strength of ReZero dealing with dynamical isometry. How about other methods’ performance here? Such as skipinit, char-TX? 5. D.7, “the rule to implement ReZero is”, so there is only one rule here? Since I only saw “1.” Just wondering if there are “2./3./…”. 5. Overall score: Please select the category that best describes your overall assessment of the paper. Please see https://www.auai.org/uai2021/reviewing_instructions#Q5 for more information. Strong accept: Outstanding paper 6. Justification for your score: Please explain in a few lines how you arrived at your overall assessment. Which aspects mentioned under main strengths and main weaknesses (Q2 & Q3) did you weigh most heavily and why? This paper is a fundamental research on building deep neural networks with faster convergence speeds, with clear motivation, expression, illustration and experimental results. ReZero is applicable to many representative networks such as ResNet, Transformer and this will highly possibly make a hugh impact to the research field. 7. Confidence in your score: Please rate your confidence in your assessment. You are confident but not absolutely certain that the evaluation is correct. Reviewer #4 Questions 1. Summary and contributions: Please summarize the paper’s motivation and key contributions in a few lines. Please see https://www.auai.org/uai2021/reviewing_instructions#Q1 for more information. This paper introduces a simple yet effective technique ReZero, by adding a learnable coefficient on the residual module of ResNet or doing slight modification on Transformer, it can largely accelerate the convergence during training. 2. Main strengths: Please describe the main strengths of the work, considering primarily the following axes: originality/novelty, significance/impact, soundness/technical quality, reproducibility and clarity of writing. Please mention as many main strengths as there are, but avoid minor points here. For more information about what we are looking for, please see https://www.auai.org/uai2021/reviewing_instructions#Q2 This paper is well-organized; the idea is very straightforward and easy to understand. The empirical results are impressive. 3. Main weaknesses: Please describe the main weaknesses of the work, considering the same axes as in Q2: novelty/originality, impact/significance, soundness/technical quality, reproducibility, and clarity of writing. Please mention as many main weaknesses as there are, but avoid minor points here. For more information about what we are looking for, please see https://www.auai.org/uai2021/reviewing_instructions#Q3 In table 2 and table 5, the zero gamma baseline performs worse than the vanilla baseline seems counterintuitive to me. 4. Detailed comments to the authors: Please provide constructive criticism and feedback that could help improve the work or its presentation (e.g., presentation suggestions, missing references, minor mistakes and typos or grammar improvements). You may also include questions to the author here. Please see https://www.auai.org/uai2021/reviewing_instructions#Q4 for more information. I do not think this title is appropriate as all this paper presented is a technique for faster coverage. The current title is a little bit over claim. 5. Overall score: Please select the category that best describes your overall assessment of the paper. Please see https://www.auai.org/uai2021/reviewing_instructions#Q5 for more information. Weak accept: Borderline paper, tending to accept 6. Justification for your score: Please explain in a few lines how you arrived at your overall assessment. Which aspects mentioned under main strengths and main weaknesses (Q2 & Q3) did you weigh most heavily and why? Overall I think this paper present an interesting and useful technique which can be widely used despite its simpleness. So that my rating is weak accept. 7. Confidence in your score: Please rate your confidence in your assessment. You are confident but not absolutely certain that the evaluation is correct. 10. Considerations after author rebuttal and discussion period: After reading the authors' response and other reviewers' comments I choose to keep my current rating as weak accept Reviewer #5 Questions 1. Summary and contributions: Please summarize the paper’s motivation and key contributions in a few lines. Please see https://www.auai.org/uai2021/reviewing_instructions#Q1 for more information. In this work, the authors introduce the ReZero technique and show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. 2. Main strengths: Please describe the main strengths of the work, considering primarily the following axes: originality/novelty, significance/impact, soundness/technical quality, reproducibility and clarity of writing. Please mention as many main strengths as there are, but avoid minor points here. For more information about what we are looking for, please see https://www.auai.org/uai2021/reviewing_instructions#Q2 This paper introduces ReZero, which is a simple architectural modification that facilitates signal propagation in deep networks and helps the network maintain dynamical isometry, and there is an improvement in convergence speeds. 3. Main weaknesses: Please describe the main weaknesses of the work, considering the same axes as in Q2: novelty/originality, impact/significance, soundness/technical quality, reproducibility, and clarity of writing. Please mention as many main weaknesses as there are, but avoid minor points here. For more information about what we are looking for, please see https://www.auai.org/uai2021/reviewing_instructions#Q3 There is room for improvement regarding the writing, particularly in the Introduction section. 4. Detailed comments to the authors: Please provide constructive criticism and feedback that could help improve the work or its presentation (e.g., presentation suggestions, missing references, minor mistakes and typos or grammar improvements). You may also include questions to the author here. Please see https://www.auai.org/uai2021/reviewing_instructions#Q4 for more information. In the Introduction section, currently the explanation about ReZero is shown in the same paragraph where the authors talk about 'Recent theoretical work', it could be better to move it into a new paragraph in order to allow the reader to identify clearly where the proposal from the authors start, and it might help to highlights the 'contributions' of this work instead of only the 'benefits'. Furthermore, at the end of the same section. adding a summary of the folowing sections will give the reader an overview of the research work. In Table 1, is not clear what are the various form of normalization and residual connections, it could be better to reword the caption to meet the explanation given in the body about this table, and adding an explanation in the body of the text about the similarities and differences against ReZero. The equation (4) is different to the formula shown in Table 1, please justify this difference. Figure 3, is not properly explained neither in the caption nor in the body of the text. In the Conclusions seciton, authors could explain how they meet the contirubutions or 'benefits' stated in the Introduction section. 5. Overall score: Please select the category that best describes your overall assessment of the paper. Please see https://www.auai.org/uai2021/reviewing_instructions#Q5 for more information. Weak accept: Borderline paper, tending to accept 6. Justification for your score: Please explain in a few lines how you arrived at your overall assessment. Which aspects mentioned under main strengths and main weaknesses (Q2 & Q3) did you weigh most heavily and why? This paper introduces a simple architectural modification in deep networks, which turns out on an improvement in convergence speeds, there is some details in the writing that can be corrected with a minor revision. 7. Confidence in your score: Please rate your confidence in your assessment. You are confident but not absolutely certain that the evaluation is correct. ================================================== AAAI 2021 (reject) ================================================== Reviewer #1 Questions 1. {Summary} Please summarize the main claims/contributions of the paper in your own words. (Do not provide any review in this box) This is an empirical work, whereby the authors introduce a variant of the residual network architecture block, and perform a thorough experimental campaign to evaluate the benefits of the proposed architecture, as compared to a large number of alternatives. The general take home message is that the proposed method is: 1) easy to implement, 2) complementary to alternative techniques such as batch/layer normalization, 3) generally superior in terms of convergence times 2. {Novelty} How novel is the paper? Paper contributes some new ideas 3. {Soundness} Is the paper technically sound? The paper has minor technical flaws that are easily fixable 4. {Impact} How important is the paper likely to be, considering both methodological contributions and impact on application areas? The paper will have a broad and significant impact 5. {Clarity} Is the paper well-organized and clearly written? Excellent: paper is well organized and clearly written 6. {Evaluation} Are claims well supported by experimental results? Good: Experimental results are sufficient, though more analysis would significantly add support to the claims 7. {Resources} How impactful will this work be via sharing datasets, code and/or other resources? (It may help to consult the paper’s reproducibility checklist.) Not applicable: no shared resources 8. (Reproducibility) Would the experiments in the paper be easy to reproduce? (It may help to consult the paper’s reproducibility checklist.) Meets Minimum Standard: e.g., code/data unavailable, but paper is clear enough that an expert could confidently reproduce 9. {Reasons to Accept} Please describe the paper’s key strengths. 1) Very simple technique that is easy to implement, and integrate with existing Deep network architectures 2) The proposed architectural block allows training of very deep models, which could be a competitive advantage in some applications requiring high expressivity/flexibility of the model 3) High potential in providing increased training performance 10. {Reasons to Reject} Please describe the paper’s key weaknesses. 1) The theoretical aspects underpinning the proposed method are only superficially addressed in this paper. I could not find a solid theoretical formulation to justify the connection with recent theoretical work on dynamical isometry, as claimed in the paper. 2) The intuitive description of why the method works, which could explain why the experimental results are extremely favorable, is only partially solid. The intuition is easy to grasp, but the mathematical details are somehow elusive. An additional intuition of the role of the $\alpha_i$ trainable parameters is given at the end of the paper, but appears to be valid only for transformer architectures, and it is not clear if they are generally valid. 3) The experimental results are compelling, but in some cases the claim for faster training is not fully supported by the results. 11. {Detailed Comments} Please provide other detailed comments and constructive feedback. - Section 1: the last item on the contributions of this work claims a speedup for resnet on CIFAR10 that does not seems to be reported in the results section. Whereas it is clear the proposed method requires fewer iterations to converge for the transformer architecture, this is not clear for the Resnet architecture - Section 2: the network width is labelled $w$, which is the same notation used in section 3 to indicate the weights in the toy example. This should be addressed, e.g. by not introducing the notation on the width, which is not fundamental in the paper - Section 2.1: the last paragraph indicates that ReLU activations imply that dynamical isometry cannot be satisfied. Yet, in the experiments the authors use ReLU activations, while claiming at the same time that their method *ensures* dynamical isometry. While this can be true if we carefully ponder the adjective "initially", which when $\alpha_i = 0$ implies an identity mapping, I think this aspect should be clarified in the paper. - Section 3: it would be very useful to be a bit more precise about the learnable parameters $\alpha_i$ and their nature. These are intended to be scalars, which multiply element wise the vector output by the function F_i. - Section 3: eq.4 should receive more attention, e.g. to describe the situation in which we would like $x_{i+1}$ to have different dimensions than $x_i$ - Section 3.1: I think for the toy model it is important to clarify that the assumption is that there are no non-linearities, that is we are assuming a deep linear model - Section 3.1: eq.5 could benefit from being expressed as the overall parametric functional mapping from input to output, that is, express it as $F(x_0, w)$. This would help defining the cost function considered in the toy example, as -- for example -- $C(x, \alpha, w, y) = 1/2 \sum_{i=0}^{N} C(x^{(i)}, \alpha, w, y^{(i)}$, if we consider i, i.i.d. training samples. If we want to simplify, we can consider a single training sample, and get rid of the sum and simplify the notation. - Sec 3.1: eq.6 is problematic in my humble opinion. This is the GD update rule, corresponding to minimizing the cost function w.r.t. the parameters. Hence, the gradient should be taken w.r.t. w, and not x. Also, by defining appropriately a cost function, it would be much more clear how to derive the terms after the learning rate $\lambda$. Additionally, the gradient term w.r.t. x of the cost is hard to justify. - Sec 3.1: eq.7 gives a good intuition on the dependence of the learning rate on the depth of the architecture. However, since also $\alpha$ is a trainable parameter, it would be fair to write the GD update rule for learning it, which involves computing the gradient of the cost function w.r.t. $\alpha$. Then, a similar expression to eq.6 would be obtained. Note also that, as advocated by the authors, initializing $\alpha=0$ implies that in eq.6, there are no updates to the weight $w$, as long as $\alpha$ becomes different from zero. As such, it would be expected that $w$ changes little in the first iterations of GD, as such also the training loss decay should not be so sharp. Finally, by expressing the GD update rule for $\alpha$, it would become clear that the learning rate used to learn it could be different from the one used for the GD update rule for $w$. - Section 4: the results on the training loss going to zero are quite impressive. It would be interesting to know: 1) how is the process of learning $\alpha$ evolving throughout the iterations, and what is the impact of $\alpha$ on the "distance" of the solution (the weight iterates) w.r.t. the initialization. Additionally, whereas it is clear that convergence speed here is defined as the number of iterations required to reach a zero training loss, it would be interesting to define another performance metric based on test loss, or validation/test accuracy, e.g., how many steps to reach a given value. - Section 5: the claim is that ReZero accelerates training and improves accuracies. However, this claim is not supported by Table 2. There we see that Gated ResNet is generally faster to reach 80% accuracy. This is instead true for pre-activation ResNet-18 and 50, but it is hard to compare to the previous results. The improvement on accuracy of ReZero seems to be linked to higher depth, and for the pre-activation experiments, the improvement is clear. Finally, the acceleration in training is claimed when superconvergence is used, but here ReZero is only compared to itself. => Overall, I think the comments to results should be clarified: whereas I believe that experimentally ReZero is indeed superior to competitors, the table is not so clear, and we do not find the claim from the introduction on the speed to reach 85% accuracy. - Section 6: here results are very compelling and impressive. It is a pity that the idea of observing the histograms of singular values (would it have been simpler to look at the eigenspectrum instead?) was not done consistently across all architectures studied in the paper. Moreover, also the study of the evolution of $|\alpha_i|$ across training would have been interesting for other architectures as well. Why only results for the 64-layers ReZero transformer are presented? 12. {QUESTIONS FOR THE AUTHORS} Please provide questions for authors to address during the author feedback period. (Please number them) I think the paper presents a very neat idea, and the experimental campaign is done diligently. Here are questions I would address: - How can you justify more thoroughly the relation to the theoretical foundations of the proposed architectural block? The idea of a toy example is very good, but then I think it should be developed more, and the equations should be double checked (hopefully, my doubts are not only a matter of a different notation I am used to) - Some of the claims for the superiority of ReZer w.r.t. architectures such as MLP and ResNet configurations can be improved, both in terms of the description of the results, in terms of a proper and consistent definition of a performance metric, and in terms of the depth of study of the learnable parameters $\alpha$ as well as the singular value histograms, as done for the transformer network. ===== After rebuttal and discussions ===== Thanks to the authors for their feedback. The main contribution of this work is empirical, and the results presented are quite compelling, suggesting that the proposed method is indeed a valuable contribution. The methodological aspects of this work need to be strengthened, as well as some clarifications about the behavior of the newly introduced parameter $\alpha$. Given the discussions and the authors' feedback, I will keep my score. 13. {Ethical Considerations} Please highlight any ethical considerations that must be considered before a final decision (it may help to consult the paper’s ethics statement, if provided) Not applicable 14. (OVERALL SCORE) 6 - Above threshold of acceptance 19. I acknowledge that I have read the author's rebuttal and made whatever changes to my review where necessary. Agreement accepted Reviewer #2 Questions 1. {Summary} Please summarize the main claims/contributions of the paper in your own words. (Do not provide any review in this box) This paper proposes to introduce a weight for the sub-layer wrapped by residual connection initialized with 0s, which leads to faster convergence according to their experiments. 2. {Novelty} How novel is the paper? Main ideas of the paper are known or incremental advances over past work 3. {Soundness} Is the paper technically sound? The paper has major technical flaws, necessitating another review after corrections 4. {Impact} How important is the paper likely to be, considering both methodological contributions and impact on application areas? The paper will impact a moderate number of researchers 5. {Clarity} Is the paper well-organized and clearly written? Good: paper is well organized but language can be improved 6. {Evaluation} Are claims well supported by experimental results? Moderate: Experimental results are weak: important baselines are missing, or improvements are not significant 7. {Resources} How impactful will this work be via sharing datasets, code and/or other resources? (It may help to consult the paper’s reproducibility checklist.) Not applicable: no shared resources 8. (Reproducibility) Would the experiments in the paper be easy to reproduce? (It may help to consult the paper’s reproducibility checklist.) Good: e.g., code/data available, but some details of experimental settings are missing/unclear 9. {Reasons to Accept} Please describe the paper’s key strengths. This paper introduces a weight vector for the sub-layer wrapped by the residual connection initialized with 0s, and show this can help accelerate the convergence. 10. {Reasons to Reject} Please describe the paper’s key weaknesses. There are many important but missing issues: The after converging performance is not provided for many experiments of NLP tasks, will the approach outperform / underperform the other approaches after training? Initializing the gating vector with 0s seems counter motivation, using deep networks aims to model complicated functions, instead of modeling identity function. Zero initializations also prevent gradients to the wrapped sub-layer during back-propagation, will these sub-layers function properly / contribute to the overall performance? Some papers shall that the convergence issue of deep Transformers is due to the layer normalization subsequent to the residual connection, the ReZero approach removes layer normalization at the same time, will the approach work will with the layer normalization? or can simply removing the layer normalization but without ReZero ensure the convergence of deep Transformers effectively? 11. {Detailed Comments} Please provide other detailed comments and constructive feedback. Please provide the evaluation performances for NLP tasks. For deep Transformers, since your approach also removes the layer normalization, please compare with corresponding deep Transformers without layer normalization. 12. {QUESTIONS FOR THE AUTHORS} Please provide questions for authors to address during the author feedback period. (Please number them) 1, Will the approach outperform / underperform the other approaches after training in NLP tasks? 2, Will these sub-layers function properly / contribute to the overall performance with the ReZero approach? 3, Will the approach work will with the layer normalization? or can simply removing the layer normalization but without ReZero ensure the convergence of deep Transformers effectively? --- After author response --- Many thanks for the auther response, I feel that the issue 3 is well addressed and the issue 2 is partially addressed. But I feel that the paper may need another review, as the final performance after training is not reported for some NLP tasks, and the concern is not convincingly addressed that this algorithm may tend to move toward \alpha=0 very fast, which may reduce the sub-layer to an identity mapping. 13. {Ethical Considerations} Please highlight any ethical considerations that must be considered before a final decision (it may help to consult the paper’s ethics statement, if provided) N/A 14. (OVERALL SCORE) 5 - Below threshold of acceptance 19. I acknowledge that I have read the author's rebuttal and made whatever changes to my review where necessary. Agreement accepted Reviewer #3 Questions 1. {Summary} Please summarize the main claims/contributions of the paper in your own words. (Do not provide any review in this box) This paper introduces a new structure called ReZero, similar to residual structures in order to speed up the training of deep learning models and be able to have deeper models. Their structure has an identity mapping in addition to an adaptive scaling of the layer output that can replace the normalization part. They claim that initializing this scaling parameter from zero would help them to go deeper in terms of layers without exploding or vanishing gradients. Providing experimental results on different tasks suggests that this structure might be useful in various cases. ======== After response ========= Thanks to the authors for providing responses. I think the idea is nice and seems to work in practice, but the intuition of the approach is not that strong. However, since the experimental parts are showing promising results I am keeping my score. My concerns for the final convergence rate of the algorithm compared to other algorithms, as well as the problem of \alpha=0 is not resolved. Specifically for the second part, it seems that this algorithm tends to move toward \alpha=0 very fast, which is highly concerning, given that would reduce the entire network to an identity mapping. I think authors need more discussion on this as well as providing more experimental results on the value of \alpha as suggested by R1. 2. {Novelty} How novel is the paper? Paper contributes some new ideas 3. {Soundness} Is the paper technically sound? I have not checked all details, but the paper appears to be technically sound 4. {Impact} How important is the paper likely to be, considering both methodological contributions and impact on application areas? The paper will impact a moderate number of researchers 5. {Clarity} Is the paper well-organized and clearly written? Good: paper is well organized but language can be improved 6. {Evaluation} Are claims well supported by experimental results? Good: Experimental results are sufficient, though more analysis would significantly add support to the claims 7. {Resources} How impactful will this work be via sharing datasets, code and/or other resources? (It may help to consult the paper’s reproducibility checklist.) Not applicable: no shared resources 8. (Reproducibility) Would the experiments in the paper be easy to reproduce? (It may help to consult the paper’s reproducibility checklist.) Poor: e.g., experimental setup details are incomplete/unclear, and code/data are not available to aid reproducibility 9. {Reasons to Accept} Please describe the paper’s key strengths. The structure seems to have a faster convergence in different tasks, and would enable deeper models for those tasks. The experimental results shows that this structure converges faster than other most related baselines. Using this structure, they can train very deep models easily. They provide a functioning residual structure for a transformer model with large number of layers. 10. {Reasons to Reject} Please describe the paper’s key weaknesses. As diverse as their experimental part is, since this part is the main contribution of the paper, and considering this structure is competing with highly sucessful structures such as ResNets, it seems that they need to provide more rigorous results. For instance, in almost all cases they show the speed of convergence to the same level for different structures. However, it is highly beneficial to compare the final level of convergence for different algorithms, especially in underparametrized regimes. This could help us to better understand landscape of the solutions each algorithm is converging to. Also, another concerning part is the value of scaling parameters or \alpha. As it seems, most of the are converging to a number closely to zero, which is a bit surprising, because that makes the network to be an identity mapping. I think a more thorough investigation of final values for \alphas is required to better understand this structure. 11. {Detailed Comments} Please provide other detailed comments and constructive feedback. Check the section 10. 12. {QUESTIONS FOR THE AUTHORS} Please provide questions for authors to address during the author feedback period. (Please number them) Check the section 10. 14. (OVERALL SCORE) 6 - Above threshold of acceptance 19. I acknowledge that I have read the author's rebuttal and made whatever changes to my review where necessary. Agreement accepted ================================================== NeurIPS 2020 (reject) ================================================== Reviewer #1 Questions 1. Summary and contributions: Briefly summarize the paper and its contributions. In this paper, the authors propose a technique, ReZero, where the functional branch in a residual block is weighted by a learnable scalar which is initially set to zero. They illustrate that this achieves dynamical isometry at initialisation, and empirically compare networks utilising ReZero v. those without for convnets on CIFAR-10 and transformers on enwiki8. 2. Strengths: Describe the strengths of the work. Typical criteria include: soundness of the claims (theoretical grounding, empirical evaluation), significance and novelty of the contribution, and relevance to the NeurIPS community. This submission has several strengths. The proposed method is pleasingly simple, and well-motivated with a toy example. There are experiments on both image and text problems, and the paper is very clear, and well written. 3. Weaknesses: Explain the limitations of this work along the same axes as above. The main weakness of this paper is that ReZero appears to be functionally identical to gated resnets except with alpha initialised as zero instead of one. Gated resnets being a simplification of Highway nets. Please let me know if I am mistaken on this point. Although ReZero is marketed very well in this paper, I feel that it is lacking in any real novelty beyond being a small change to a 3.5 year-old paper. The technique is advertised as a simple plug-in to make things faster but Table 2 is concerning; why does ReZero make things worse on ResNet-56? Why aren’t there comparisons to the other baselines for pre-activation ResNets? There is a lack of ablation studies, although Figure 6 is a good start. How does alpha vary for the image classification experiments? 4. Correctness: Are the claims and method correct? Is the empirical methodology correct? The proposed technique is strongly linked to dynamical isometry in the narrative, where this is defined as having the singular values of the input-output Jacobian close to 1. While this is certainly true at initialisation (Figure 5) we can see that the spectra quickly spreads out. From Figure 6 it looks like alphas quickly increase. While connections to established theory are nice, I suspect that this is basically equivalent to a learnable version of warming up the learning rate as blocks come online. In a similar vein, as training proceeds and alphas get smaller, this is similar to stochastic depth (https://arxiv.org/abs/1603.09382). 5. Clarity: Is the paper well written? The paper is very well written, and clear. 6. Relation to prior work: Is it clearly discussed how this work differs from previous contributions? As I have touched upon in weaknesses it is not obvious that this is gated resnets with a different initial alpha and the authors should make this link stronger. The first time it is mentioned is in passing (line 191). While the supplementary material is more thorough on the matter, I think the connections between this, gated resnets, and highway nets should be made more clear in the background section. 7. Reproducibility: Are there enough details to reproduce the major results of this work? Yes 8. Additional feedback, comments, suggestions for improvement and questions for the authors: I have asked questions of the authors in other parts of this review. I liked this paper at first read, as it is well written and clear. However, I believe it is not substantially different from prior work for publication. ---- Post response ---- Thank you for providing a response. I have discussed this in depth with the other reviewers and I will keep my original score because there are serious concerns that have not been addressed. I will summarise these with the hope that in a future version of this paper they will be rectified: - The connection to dynamical isometry is not convincing, and feels like a smokescreen. It only occurs near initialisation. I think that ReZero is mimicking the effect of something else, e.g. stochastic depth, or learning rate warm-up, and I do not see any evidence that this is not the case. - The authors did not answer any of my questions on conducting some additional experiments: (i) Why does ReZero make things worse on ResNet56? (ii) Why are there not comparisons to pre-activation ResNets? - There is a lack of ablation on how alpha varies. These would be welcome for all experiments. 9. Please provide an "overall score" for this submission. 4: An okay submission, but not good enough; a reject. 10. Please provide a "confidence score" for your assessment of this submission. 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. 11. Have the authors adequately addressed the broader impact of their work, including potential negative ethical and societal implications of their work? Yes Reviewer #2 Questions 1. Summary and contributions: Briefly summarize the paper and its contributions. The authors propose a simple and effective method for improving the convergence speed of various deep feedforward networks. ReZero is a residual network where the non-residual part is initialized to be the identity multiplied by a scalar which is initially 0, i.e. x = x + a f(x) with initially a=0 and f(x)=x. The authors demonstrate convincing improvements for fully-connected, convolutional, and transformer networks with a large number of layers. 2. Strengths: Describe the strengths of the work. Typical criteria include: soundness of the claims (theoretical grounding, empirical evaluation), significance and novelty of the contribution, and relevance to the NeurIPS community. The method is super simple and the experiments are convincing. A toy example is given which helps in giving an intuition. 3. Weaknesses: Explain the limitations of this work along the same axes as above. The comparison would be more complete if the authors would compare ReZero with Highway Networks instead of gated resnets. Highway networks that are biased to carry the information unperturbed are identical to ReZero at initialization. While ReZero initialises the non-residual part with the identity that doesn't have a large effect at initialisation as it is not used. In my opinion the authors did not provide convincing arguments for why the identity intialisation would be necessary and I encourage the authors to add an ablation experiment. The impact of the method is surprising to me. I'd have welcomed a deeper analysis regarding dynamical isometry instead of squandering space with desctribing neural network components that are very well known in the community. 4. Correctness: Are the claims and method correct? Is the empirical methodology correct? I have read the paper carefully and have not found any mistakes. 5. Clarity: Is the paper well written? The paper is well written and motivated, albeit a little shallow. I have one issue with table 2: it is ambiguous. It is not clear if the models are fully-connected or convolutional networks (though from experience I can tell those are convnets). 6. Relation to prior work: Is it clearly discussed how this work differs from previous contributions? The authors go to great lengths to compare their work with related work. 7. Reproducibility: Are there enough details to reproduce the major results of this work? Yes 8. Additional feedback, comments, suggestions for improvement and questions for the authors: The submission doesn't provide code but the implementation is trivial. My own initial experiments seem to confirm the authors method. I'm confused about table 2 since Gated Resnets seem to have a faster convergance than ReZero (the opposite effect of figure 3) and against the main claim of the paper. I don't feel the authors have addressed this issue adequatly. They merely justify it with worse test-data performance, however, test error is not reported, only validation error (and it's worse by 0.2% and 0.8%, hardly a big gap). Convolutional highway networks have shown to perform on-par with convolutional resnets on cifar, yet another reason to properly compare with highway networks which are initially biased to carry information through layers in a similar way. Given the minor issues, I'm a little hesitant to fully endorse the paper. That said, I'm convinced the authors can make a convincing rebuttal for me to increase my score. ----- post rebuttal ----- I have reduced my score in light of the authors' rebuttal and the reviewer discussion. My main concerns are summarised here: - For this sort of contribution I think a more thorough analysis and more experiments can be expected (better metrics, more datasets, etc.) - The method is introduced from the perspective of dynamic isometry but unfortunately I failed to see the significance in the authors' arguments of why this connection is interesting and important. - The main method introduced is applied slightly differently in the different experimental settings. Those changes seem ad-hoc and are not sufficiently discussed. (E.g. switching from alpha 0 to 0.25 or switching the initialisation of the residual path from the identity to standard neural network initialisation.) - The method is showing some conflicting results and is lacking the necessary comparison with Highway Networks which are initially biased to skip the residual path. That said, I find some of the observed benefits interesting and potentially useful. I hope the authors decide to improve their analysis and resubmit a revised manuscript. 9. Please provide an "overall score" for this submission. 5: Marginally below the acceptance threshold. 10. Please provide a "confidence score" for your assessment of this submission. 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. 11. Have the authors adequately addressed the broader impact of their work, including potential negative ethical and societal implications of their work? Yes Reviewer #3 Questions 1. Summary and contributions: Briefly summarize the paper and its contributions. This paper proposes a simple way to modify the design and initialization of Residual networks, that allows one to remove the normalization layers that are typically used, and still obtain similar performance. The proposal is to modulate the learned non-residual path in each residual block with a trainable parameter alpha that is initialized to zero. It also increases convergence speed in the early phases of training, and in some cases can increase the overall convergence speed. Post Rebuttal Update: Thanks to the authors for their efforts in composing the rebuttal, and their overall efforts on this paper. In the rebuttal, 2 out of 3 questions I asked were addressed satisfactorily, but the third question was not. The concerns related to initialization are partially about clarity: it should be explicitly stated under what initializations for all types of layers is ReZero effective, so that a reader can use it with their library of choice. The question is also relevant to the reason ReZero is effective. If dynamical isometry is the reason ReZero works, then is the initialization of layers immaterial, or less important? Answering these questions is important to make this paper clear and useful for readers. Finally, the weaknesses I mentioned in my initial remain and echo the concerns of other reviewers. After discussions, I have decided not to increase my score. 2. Strengths: Describe the strengths of the work. Typical criteria include: soundness of the claims (theoretical grounding, empirical evaluation), significance and novelty of the contribution, and relevance to the NeurIPS community. 1. The main strength of this paper is that its proposal is extremely simple and does not appear to have any downsides based on the experimental results. 2. Removing the normalization layers in deep architectures is useful because it makes the networks simpler and saves computational resources. If this can be done without compromising on held-out score (the normalization operations are believed to also provide regularization), then such a technique can be widely used by practitioners. 3. Weaknesses: Explain the limitations of this work along the same axes as above. 1. The empirical evaluation in the paper is somewhat limited. The paper essentially evaluates on two "real" datasets: CIFAR-10 (for CNNs) and enwik8. It is difficult to fully recommend the use of this technique as a default based on these results, but since the modifications are rather simple, they are easy to try out in practice. 2. The experimental results select an arbitrary threshold for measuring 'convergence speed' e.g. 80% accuracy on CIFAR-10 and 1.2 BPB on enwik8. The method speeds up training according to this metric, but this is misleading if the overall training time/iterations (to reach the final reported score) remain the same. This seems to be the case since the paper only reports speedup for reaching final scores for the single superconvergence result. 3. There are some important questions related to methodology that need answers (see next section). 4. See section on relation to prior work. 4. Correctness: Are the claims and method correct? Is the empirical methodology correct? Methodology questions: 1. Why was the initialization factor 0.25 chosen for vanilla residual network (line 160)? 2. Appendix E states that some inconsistencies were observed in CIFAR-10 results using other PyTorch versions, presumably due to differences in default initializations. This is troubling: please state the inconsistencies observed and the exact initialization schemes used for these experiments. If a Pytorch default was used, please specify the default algorithms for initialization of all weights and biases. What do the comparisons look like with different initialization schemes e.g. that of He et al.? 3. Table 2 has a result for Resnet-50 on CIFAR-10, but it is unclear what this architecture is since it seems that 50 layer architectures were trained on ImageNet only in the cited paper. 5. Clarity: Is the paper well written? Yes. 6. Relation to prior work: Is it clearly discussed how this work differs from previous contributions? Some relationships with related work are only made clear in the supplementary material. In particular, the "Gated Resnet" variant of Highway nets seems rather closely related to the proposed technique, and this relationship should be made clear in the main paper (with a reference to details in appendix, if needed). This is also the method that yields even faster convergence that the proposed technique, but only lags behind in final performance for unknown reasons. A very closely related and important reference that is missing is Balduzzi et al. [A]. See Sec. 3.2 and 3.3, where beta scaling was considered as a theoretically grounded fix for whitened gradients. While it is unclear why ReZero also brings good generalization, it essentially improves residual networks by reducing gradient whitening as shown by Balduzzi et al. Therefore, this reference should be clearly discussed and highlighted in this paper. [A]: Balduzzi, David, et al. "The shattered gradients problem: If resnets are the answer, then what is the question?." arXiv preprint arXiv:1702.08591 (2017). 7. Reproducibility: Are there enough details to reproduce the major results of this work? No 8. Additional feedback, comments, suggestions for improvement and questions for the authors: I'm open to updating my rating. To improve the paper, please address questions on methodology and prior work. In particular, the specific initialization schemes under which does the proposed method bring benefits compared to vanilla and Gated Residual networks should be stated very clearly. The relationship to Highway/Gated Resnets and proposals of Balduzzi et al. should also be discussed clearly in the paper. Finally, if possible, please add results on additional datasets. 9. Please provide an "overall score" for this submission. 6: Marginally above the acceptance threshold. 10. Please provide a "confidence score" for your assessment of this submission. 5: You are absolutely certain about your assessment. You are very familiar with the related work. 11. Have the authors adequately addressed the broader impact of their work, including potential negative ethical and societal implications of their work? Yes Reviewer #4 Questions 1. Summary and contributions: Briefly summarize the paper and its contributions. The paper proposes ReZero, a simple architecture modification for residual networks. The authors introduce a residual connection for the input signal x and a trainable parameter \alpha that modulates the non-trivial transformation of a layer F (x). The advantages of this architecture include easy implementation, deeper learning (ReZero can train networks with 10,000 layers successfully), and faster convergence. To show the faster convergence, the authors conduct experiments on both CV and NLP tasks, for example, it can converge 56% faster on enwiki8 when \alpha = 0. Update: I have read the authors' rebuttal, the other reviews, and the messages during the discussion. I acknowledge the authors for addressing my review. I have reduced my score in light of the authors' rebuttal and the reviewer discussion. My main concern is that given the somewhat limited novelty, the experiments should be adequate. The full curves should be provided. Otherwise, the underlying reason this method works is not clear. I hope the authors decide to improve their analysis in a revised manuscript. 2. Strengths: Describe the strengths of the work. Typical criteria include: soundness of the claims (theoretical grounding, empirical evaluation), significance and novelty of the contribution, and relevance to the NeurIPS community. - The idea is interesting, the method is simple, and the implementation is easy. - The authors conduct exhausitive experiments on both images and texts, and the experimental results show that ReZero is sufficient to train deeper networks and can achieve faster convergence. - The paper and source codes will be useful for our community to train deeper networks. 3. Weaknesses: Explain the limitations of this work along the same axes as above. - The proposed method is mainly applicable for residual networks. 4. Correctness: Are the claims and method correct? Is the empirical methodology correct? Yes. 5. Clarity: Is the paper well written? Yes, the writing is excellent. 6. Relation to prior work: Is it clearly discussed how this work differs from previous contributions? Yes. 7. Reproducibility: Are there enough details to reproduce the major results of this work? Yes 8. Additional feedback, comments, suggestions for improvement and questions for the authors: - Line 267: can you provide any insights why \alpha = 1 does not improve over the vanilla Transformer? 9. Please provide an "overall score" for this submission. 5: Marginally below the acceptance threshold. 10. Please provide a "confidence score" for your assessment of this submission. 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. 11. Have the authors adequately addressed the broader impact of their work, including potential negative ethical and societal implications of their work? Yes