Meta Review of Paper2413 by Area Chair z4UP ACL ARR 2023 December Paper2413 Area Chair z4UP 02 Feb 2024ACL ARR 2023 December Paper2413 Meta ReviewReaders: Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Authors, Paper2413 Reviewers Submitted, Program ChairsShow Revisions Metareview: This work proposes a method train a language model so that it can handle longer contexts beyond the pre-defined maximum length limits. Basic idea is to 1) stretch the original positional embedding to the longer one by interpolating the surrounding contexts and 2) segment and subsample chunks but preserving the original positional embeddings. Experiments show better performance when compared with other baselines on standard benchmarks. Summary Of Reasons To Publish: The idea for interpolating surrounding positions for stretched positions sounds novel. Also, the training procedure for segmenting and concatenating while preserving original positional information is interesting. Experiments show clear gains on out-of-box models with diverse attention mechanisms. Summary Of Suggested Revisions: The most gains seem not significant, but the given the training efficiency, this could be interpreted as comparable. Training speed should be reported to justify the benefits of the proposed approach. More analysis and discussion will be needed to quantify the gains and to understand the impact of linear interpolation of surrounding embeddings. Some clarity issues are reported by our reviewers. Overall Assessment: 4 = There are minor points that may be revised Suggested Venues: ACL* conferences. Best Paper Ae: No Ethical Concerns: None. Needs Ethics Review: No Author Identity Guess: 1 = I do not have even an educated guess about author identity. Great Reviews: Akid RcoQ Add [–] Official Review of Paper2413 by Reviewer RcoQ ACL ARR 2023 December Paper2413 Reviewer RcoQ 17 Jan 2024ACL ARR 2023 December Paper2413 Official ReviewReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Paper Summary: This paper is about allowing language models to process sequences longer than those they are trained on. First, they propose an interpolation approach for extending absolute embeddings beyond their original length, where the location for each embedding is stretched out and interpolation between adjacent embeddings is used for new positions between the original embeddings. They show that this method allows the model to maintain a similar perplexity when increasing length up to 5x the original length, extrapolating better than vanilla RoPE and roughly similar to AliBi. Second, they propose finetuning with a subsample of tokens from a longer sequence while keeping their original position embeddings. They compare two different variants of this approach, one where random chunks are kept from throughout the sequence and another where a large chunk near the end is kept along with some random tokens from earlier on. On a perplexity evaluation the method does better than a model with no finetuning but not quite as well as a model trained on the full sequence length (which requires more accelerator memory). An additional experiment shows that while a significant chunk of the improvement over out-of-the-box models comes simply from finetuning on in-domain data, but that there is still some further improvement beyond that due to their method. Summary Of Strengths: Results are given across three different models with different attention types showing improvements over out-of-the-box models. The absolute embedding interpolation seems like a nice trick, though people have done a similar thing before with RoPE (https://arxiv.org/pdf/2306.15595.pdf, which might be good to cite). Summary Of Weaknesses: The results improvements compared to the most fair baseline with domain adaptation don't seem that large: .1 perplexity improvement, and it is a bit hard to tell how significant that is. The results also do not reach the same level as full-sequence training, so I believe people would want to use that unless they can't due to memory constraints (the difference is about .1 perplexity, roughly the same as the difference between the proposed method and the domain adaptation baseline). Significant FLOPS or wall-time training speed savings could help boost the usefulness, but the speed isn't reported. Comments, Suggestions And Typos: What is the difference in training speed between your segmented method and full sequence training? If I understand correctly you are training on the same total number of tokens, so any improvement would be from halving the quadratic attention length. Since the models with domain adaptation as described in 5.4 are the most fair comparison against your model, I think those numbers should go in all the main results tables as the primary baseline. It might also be good to cite and ideally compare to https://aclanthology.org/2023.acl-long.816.pdf. What do you mean at line 102 by "sinusoidal embeddings are difficult to parallelize"? I'm not aware of any issues with parallelizing them. I'm not sure I understand what the equation at line 199 is saying or how you use it in your method or results. The results on ZeroSCROLLS in the appendix don't seem very helpful since they are so low. It would be great to get some downstream results where results are more reasonable, possibly by using few-shot or finetuning instead of zero-shot. Otherwise, it might be better to just remove it. In Figure 3, is the total amount of weight beyond the original length about the same before and after tuning, or are these values normalized? Also, why are the values lower at the far left of this plot, which I understand to be right after the cutoff? I would have expected them to be monotonically decreasing. It is a little hard to see exactly what the numbers are doing for APE in Figure 2. It could be good to zoom in on those a bit more (possibly dropping some of the very high RoPE points) or provide numbers in table. Soundness: 3.5 Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions. Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings. Best Paper: No Ethical Concerns: None Needs Ethics Review: No Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Rebuttal Part 2 ACL ARR 2023 December Paper2413 AuthorsPetros Karypis(privately revealed to you) 29 Jan 2024ACL ARR 2023 December Paper2413 Official CommentReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Comment: In Figure 3, is the total amount of weight beyond the original length about the same before and after tuning, or are these values normalized? Also, why are the values lower at the far left of this plot, which I understand to be right after the cutoff? I would have expected them to be monotonically decreasing. Figure 3 is a histogram of the median attention weight attended to for positions past . The total amount of weight is the same before and after tuning as total weight each position attends to sums to 1. This plot shows the shift in distribution of attention weights before and after length extension. These results suggest that after extension there is more variance in the distribution of attention weights which could mean the model has learned to better leverage information in the context. It is a little hard to see exactly what the numbers are doing for APE in Figure 2. It could be good to zoom in on those a bit more (possibly dropping some of the very high RoPE points) or provide numbers in table. We will provide the exact numbers in the appendix for the final version; the relevant table is below. (ppl.) 1x 2x 3x 4x 5x APE 6.675 6.326 6.394 7.099 8.438 RoPE 6.677 17.348 45.797 69.288 NA ALiBi 7.217 7.295 7.653 7.773 NA Add [–] Rebuttal Part 1 ACL ARR 2023 December Paper2413 AuthorsPetros Karypis(privately revealed to you) 29 Jan 2024ACL ARR 2023 December Paper2413 Official CommentReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Comment: The results improvements compared to the most fair baseline with domain adaptation don't seem that large: .1 perplexity improvement, and it is a bit hard to tell how significant that is. The results also do not reach the same level as full-sequence training, so I believe people would want to use that unless they can't due to memory constraints (the difference is about .1 perplexity, roughly the same as the difference between the proposed method and the domain adaptation baseline). Regarding the perplexity gains of only , this is true for ALiBi ( for and ); however the gains for absolute positional embeddings (APE) for is . The results on APE for indicate that the interpolation method we proposed works well for extrapolating to new lengths not that much longer than the original input size. What is the difference in training speed between your segmented method and full sequence training? If I understand correctly you are training on the same total number of tokens, so any improvement would be from halving the quadratic attention length. Correct, the compute savings when training on an equivalent number of tokens come from the training on shorter sequences. This impacts the quadratic scaling of attention, meaning training on segments requires less memory. In cases that use large distributed systems, training on shorter sequences also reduces the communication required for tensor parallelism and when pipelined parallelism is used, it reduces the time wasted due to pipeline bubbles. We did not report training timing because it is very implementation dependent and generally does not make for a good measure of efficiency [1]. For example, training optimizations that aim to reduce the quadratic memory scaling like flash attention are not compatible with ALiBi; interactions like these lead to very different speed-ups. In our case training on segmented sequences is on average faster than training on sequences twice as long on sequences four times as long. We will include this information in the revised manuscript. Since the models with domain adaptation as described in 5.4 are the most fair comparison against your model, I think those numbers should go in all the main results tables as the primary baseline. Thank you for the suggestion, we will move the domain adaptation results to the main tables. It might also be good to cite and ideally compare to https://aclanthology.org/2023.acl-long.816.pdf. Sun et al. [2] and their xPos positional method demonstrate strong extrapolation abilities. They do so by proposing a new positional embedding method to use alongside block-wise attention. In our work, we look at improving the extrapolation ability of existing positional embedding methods. We will include it as a reference in section 2.1; however we do not believe this is a fair comparison as they tackle a slightly different problem. What do you mean at line 102 by "sinusoidal embeddings are difficult to parallelize"? I'm not aware of any issues with parallelizing them. That was a mistake on our part. One of the earliest proposed, non-sinusoidal, relative positional embedding methods[3] had issues with parallelization[4]. I'm not sure I understand what the equation at line 199 is saying or how you use it in your method or results. The equation on line 199 is a formal definition for the criteria for a model to extrapolate properly. We decided to report perplexity in practice as it is a more standard metric and we found the two were correlated. The results on ZeroSCROLLS in the appendix don't seem very helpful since they are so low. It would be great to get some downstream results where results are more reasonable, possibly by using few-shot or finetuning instead of zero-shot. Otherwise, it might be better to just remove it. That is a good point, we will consider omitting these results from the final version. [1] No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models [2] A Length-Extrapolatable Transformer [3] Self-Attention with Relative Position Representations [4] The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models Add [–] Official Review of Paper2413 by Reviewer Akid ACL ARR 2023 December Paper2413 Reviewer Akid 16 Jan 2024 (modified: 29 Jan 2024)ACL ARR 2023 December Paper2413 Official ReviewReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Paper Summary: This paper proposes a training method to enable the length extrapolation capability of a Transformer language model. The core contribution is that the proposed method does not incur additional training costs as the input sequence length is kept fixed. The authors achieved so by including segmented/skipped chunks that are far away from the current context. Experimental results show superior performance compared to the original Transformer language models. Summary Of Strengths: I think the idea is pretty neat: If the recency bias of RPE hinders a Transformer language model from attending to far away tokens, just include them in a skip-wise manner during the training process. The authors demonstrated empirically that such discontinued text chunks can still benefit the extrapolation capability of a model, which is quite surprising. Summary Of Weaknesses: There are two factors that lead to the performance gain in my opinion: The usage of distant content information that forces the model to focus on not only recent tokens but also distant ones if they contain useful information. The early exposure of extrapolated position indices. This ensures that a model has at least observed extrapolated position indices during the training time, and it won't be so confused, so to speak, during the extrapolation stage. I am not exactly sure which factor is the most critical one (or they are equally important?) from the experiments presented in the current submission. The second factor is very close to the method proposed in Ruoss et al, in which they used randomized position indices during the training process without increasing the training sequence length. I believe including this additional ablation study can help strengthen the contribution of this paper. That is, you only perturb the position indices but keep the content the same as the most 2,048 ones. If the performance worsens, then we know 1. also plays a crucial role. As a side note, this submission shares some merits with PoSE. I am aware that PoSE was posted within three months of the ARR submission deadline, so it's not mandatory for the authors to compare against it. I am including it just for your reference. Ruoss et al: Randomized Positional Encodings Boost Length Generalization of Transformers PoSE: https://openreview.net/forum?id=3Z1gxuAQrA Comments, Suggestions And Typos: See above. Soundness: 4 = Strong: This study provides sufficient support for all of its claims/arguments. Some extra experiments could be nice, but not essential. Overall Assessment: 4 = This paper represents solid work, and is of significant interest for the (broad or narrow) sub-communities that might build on it. Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings. Best Paper: No Ethical Concerns: None. Needs Ethics Review: No Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Rebuttal ACL ARR 2023 December Paper2413 AuthorsPetros Karypis(privately revealed to you) 29 Jan 2024ACL ARR 2023 December Paper2413 Official CommentReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Comment: We thank you for leaving insightful feedback and references. We compared our approach with the method described in Ruoss et al.[1]. We used the 125m parameter model with absolute positional embeddings (APE) and extended the model to and its original input length. We kept all other settings the same as described in section 5.2. We found our approach outperformed Ruoss et al. suggesting both (1) learning to use distant information and (2) exposure to extrapolated indices are important to long sequence performance. The results are shown in the table below, OOTB ("out of the box") refers to the performance of the model with position interpolation (described in section 3.1) and no extra pretraining. (ppl.) APE OOTB 9.322 13.275 chunk 8.420 7.989 Ruoss et al. 9.018 11.534 We will include these results in the final version. [1] Randomized Positional Encodings Boost Length Generalization of Transformers Add [–] Thank you for the new results ACL ARR 2023 December Paper2413 Reviewer Akid 29 Jan 2024ACL ARR 2023 December Paper2413 Official CommentReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Comment: Dear authors, Thank you for the new results. This is very useful. Please include them in the final revision. I will raise the soundness score from 3.5 to 4.0. Thanks. Add [–] Official Review of Paper2413 by Reviewer dK4i ACL ARR 2023 December Paper2413 Reviewer dK4i 16 Jan 2024 (modified: 31 Jan 2024)ACL ARR 2023 December Paper2413 Official ReviewReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Paper Summary: This paper describes a new method for positional embeddings. The authors propose an "interpolation-based approach which allows APE models to extrapolate to sequence lengths longer then they were trained on", and a segmented method of training which is shown to achieve competitive results with full sequence training. --> Post discussion raised overall assessment from 1 to 2.5, and lowered self-confidence from 3 to 2. Summary Of Strengths: The baseline comparisons are strong, against the main state of art positional embedding methods of RoPE and Alibi. The method seems like it could be interesting based on Figure 2 and Table 3. Summary Of Weaknesses: The paper seems hastily written and leaves confusion for the reader. For instance, The equation on 199 doesn't seem right to me. If k>Lt, then the equation on the left has less context than the equation on the right. The equation on the left should have a lower score. Both the equation on the left and right sees a context longer than what was seen in training. Lines 206 to 221 seem like they would belong more in the introduction or related work rather than the methods section. Line 230 to 239 might be the most important part of the paper introducing the method but I cannot tell if it is carefully/cleverly derived, or a heuristic. Why is there e_Lt=e_Lt-1 ? "We create short input sequences by sampling segments from the long sequences," - ok but how does this help in extrapolation to unseen sequence lengths? The results and analysis describe experiments on segmented training while the methods describe something else. I'm confused about Table 1 in relation to the rest of the results. I hope we are not looking at different sized models and architectures, and then saying that the perplexity are comparable along the axis of positional embeddings? 7.5 pages (excluding limitations), equations not properly numbered. Comments, Suggestions And Typos: I apologise for the terse review but the paper seems to me in early stages and perhaps was not quite ready for submission Soundness: 2.5 Overall Assessment: 2.5 Confidence: 2 = Willing to defend my evaluation, but it is fairly likely that I missed some details, didn't understand some central points, or can't be sure about the novelty of the work. Best Paper: No Ethical Concerns: NA Needs Ethics Review: No Reproducibility: 2 = They would be hard pressed to reproduce the results: The contribution depends on data that are simply not available outside the author's institution or consortium and/or not enough details are provided. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Rebuttal ACL ARR 2023 December Paper2413 AuthorsPetros Karypis(privately revealed to you) 26 Jan 2024ACL ARR 2023 December Paper2413 Official CommentReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Comment: We thank you for leaving a thorough set of comments and feedback. We will improve the writing for the final version. Q1: The equation on 199 doesn't seem right to me. If k>Lt, then the equation on the left has less context than the equation on the right. The equation on the left should have a lower score. Both the equation on the left and right sees a context longer than what was seen in training. We apologize for any confusion our notation may have caused. The equation is correct according to our definition for on line 200. The term on the left-hand side of the equation has a longer context and should have a lower log-likelihood (score). Q2: Lines 206 to 221 seem like they would belong more in the introduction or related work rather than the methods section. We agree with the reviewer and we will move those sentences to the introduction in the final version. Q3: Line 230 to 239 might be the most important part of the paper introducing the method but I cannot tell if it is carefully/cleverly derived, or a heuristic. Why is there ? We considered two different ways for doing the interpolation when we increase the sequence length by a factor of . The first is to create positions between each successive pair of positional embeddings; e.g., for and an initial length of obtain [1, 1.5, 2, 2.5, 3], and the second is to generate them uniformly as follows: [1, 1.4, 1.8, 2.2, 2.6, 3]. The advantage of the first is that it includes all the original embeddings and we decided to go with that approach. However, a drawback of this approach that it leads to an extended sequence whose length is instead of . We decided to set the remaining embeddings to be the final embedding from the original matrix, . The expression is a typo, it should be . Q4: "We create short input sequences by sampling segments from the long sequences," - ok but how does this help in extrapolation to unseen sequence lengths? When extending the input context of a model it must be trained to incorporate information from relative positional distances greater than those seen during training. By training on segments sampled from longer sequences we are able to expose the model to these out of distribution relative distances while remaining within a fixed context size. Our results indicate that training on the subsampled sequences is able to match up to 87% of the performance that would have been achieved by training on the full sequence. Q5: The results and analysis describe experiments on segmented training while the methods describe something else. In order to extend the input context length we take two steps. First, we extend the embedding matrix of models trained with absolute positional embeddings (APE). This step is required for APE models but not needed for RoPE and ALiBi. Second, we train on long sequences that have been subsampled to fit within the original input context length through our segmentation method. Section 3.1 describes the interpolation method and section 3.2 describes the segmentation strategies. Results in section 5.1 compare our interpolation method to the performance of other positional embedding methods without any training which we referred to as "out of the box" performance. Sections 5.2-4 explore the performance of our segmentation method. Q6: I'm confused about Table 1 in relation to the rest of the results. I hope we are not looking at different sized models and architectures, and then saying that the perplexity are comparable along the axis of positional embeddings? Table 1 describes the key model characteristics (positional embedding (PE) method, size, input context) for the three different groups of models we perform experiments on. Results are never compared between different models but general trends across the different PE families are discussed. Tables 3, 4, 5 report perplexity for each positional embedding family for a single model size. Q7: 7.5 pages (excluding limitations), equations not properly numbered. Thank you for pointing out the equation numbering, we have fixed them in the final version. If the reviewer's concern is that the main narrative of the paper (excluding appendices) is short by about half a page to the target page limit, this will no longer be the case once we incorporate the changes needed to address the other reviewers’ suggestions such as a comparison with Ruoss et al [1]. [1] Randomized Positional Encodings Boost Length Generalization of Transformers Add [–] Response to Authors (2) ACL ARR 2023 December Paper2413 Reviewer dK4i 28 Jan 2024 (modified: 28 Jan 2024)ACL ARR 2023 December Paper2413 Official CommentReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Comment: Q1: The equation on 199 doesn't seem right to me. If k>Lt, then the equation on the left has less context than the equation on the right. The equation on the left should have a lower score. Both the equation on the left and right sees a context longer than what was seen in training. I'm sorry that I'm not getting it based on your equations - if , then the context window for will be smaller than . e.g., (LHS) has a smaller context window than (RHS). Q3. Ok, thanks for the explanation. I don't fully understand the extended sequence but I'm ok with how you handle the last values but it's fine, I think its not too critical. Q4,5,6,. Ok, thanks! Q7. My main concern was that this paper felt like it was in early stages and perhaps not quite ready for submission. Q8. A sanity check on "To evaluate our methods we fine-tune three different classes of pretrained language models, one for each of the popular positional embedding methods" -- are you fine-tuning the entire model, the embedding matrix, or just the "new positional embeddings?" Line 284: I think it's not right to say that perplexity is an "inverse log probability". The form ofPerplexity you have used is the exponentiated average negative log likelihood of the sequence. For now I will raise my score to 2. I think I understand the method better now thanks to the author's explanation. While training on segments is an interesting idea, it seems like a step back from linear interpolation approaches which do not require any further pre-training. This could just be a matter of personal taste, so feel free to enlighten me, but I suspect approaches that require further pretraining in order to do extrapolation are difficult to be adopted. Also please let me know about Q8. Thanks! Add [–] Response to further questions ACL ARR 2023 December Paper2413 AuthorsPetros Karypis(privately revealed to you) 28 Jan 2024 (modified: 29 Jan 2024)ACL ARR 2023 December Paper2413 Official CommentReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Comment: Thank you for the engaging discussion. Q1: I'm sorry that I'm not getting it based on your equation - if , then the context window for will be smaller than . e.g., (LHS) has a smaller context window than (RHS). We defined as . If then, has more context. Per your example above whereas . Again, we acknowledge this notation is confusing (especially using " ") and will update it in the final version. Q7: My main concern was that this paper felt like it was in early stages and perhaps not quite ready for submission. Which aspect of the work do you believe is incomplete; the results or the writing? In this work we set out to develop a memory efficient method for extending the input context of pretrained models. To accomplish this we proposed an interpolation based method for extending absolute positional embeddings and a method to subsample long sequences to use for continual pretraining. We believe we have the necessary set of experiments to answer the research questions we set out to address. Are there any other additional experiments that would make this a more compelling work? In terms of the presentation, we plan to revise some of the notation to make it less confusing. Q8: A sanity check on "To evaluate our methods we fine-tune three different classes of pretrained language models, one for each of the popular positional embedding methods" -- are you fine-tuning the entire model, the embedding matrix, or just the "new positional embeddings?" In all experiments we fine-tune the whole model. This is standard practice when extending the input context size [1,2,3]. Line 284: I think it's not right to say that perplexity is an "inverse log probability". The form of Perplexity you have used is the exponentiated average negative log likelihood of the sequence. Thank you for pointing that out, we have corrected that in the final version. For now I will raise my score to 2. I think I understand the method better now thanks to the author's explanation. While training on segments is an interesting idea, it seems like a step back from linear interpolation approaches which do not require any further pre-training. This could just be a matter of personal taste, so feel free to enlighten me, but I suspect approaches that require further pretraining in order to do extrapolation are difficult to be adopted. Thank you for reconsidering your score. While we agree that interpolation based approaches that would allow for length extrapolation without further pre-training are preferable, this is hard to achieve in practice. As reviewer Akid pointed out below, existing interpolation-based extension methods[1,2] require extra fine-tuning to really perform on long sequences. For example, NTK-RoPE [2] requires continual pretraining on sequences of the target length (8k, 16k, 32k). Fine-tuning on sequences of this length is expensive and sometimes infeasible for certain model size/sequence length pairs even with a batch size of 1. Fine-tuning on the segmented sequences provides a fixed cost for this necessary step of length extension. [1] Extending Context Window of Large Language Models vis Positional Interpolation [2] NTK-RoPE [3] GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length Add [–] Wonderful discussion! ACL ARR 2023 December Paper2413 Reviewer Akid 28 Jan 2024 (modified: 28 Jan 2024)ACL ARR 2023 December Paper2413 Official CommentReaders: Program Chairs, Paper2413 Senior Area Chairs, Paper2413 Area Chairs, Paper2413 Reviewers Submitted, Paper2413 AuthorsShow Revisions Comment: Hello, nice discussion! Please allow me to share my two cents here with the reviewer and authors. "While training on segments is an interesting idea, it seems like a step back from linear interpolation approaches which do not require any further pre-training. " To the best of my knowledge, most of the linear interpolation methods need to be fine-tuned (or further pre-trained if that's the name preferred). The only method that claims to be fine-tuning free is the NTK-RoPE method on language modeling tasks only (see Table B.7 of [1]). In my own experiments, NTK-RoPE does not perform well on even the synthetic passkey retrieval task without fine-tuning. Therefore, I believe "approaches that require further pretraining in order to do extrapolation are difficult to be adopted." is not necessarily true as existing widely adopted approaches also require fine-tuning/further pretraining. I could be wrong, and I am also willing to hear what the authors have to say. As a side note, the only work I am aware of that can (somewhat) do fine-tuning free length extrapolation is [2]. However, [2] relies on some temperature scaling methods and T5 positional embeddings (not RoPE and ALiBi experimented in this submission), so I don't think the authors need to compare their work against it. [1] YaRN: Efficient Context Window Extension of Large Language Models, https://openreview.net/pdf?id=wHBfxhZu1u [2] Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation, https://arxiv.org/abs/2311.00684 Add [–] Supplementary Materials by Program Chairs ACL ARR 2023 December Program Chairs 16 Dec 2023ACL ARR 2023 December Paper2413 Supplementary MaterialsReaders: Program Chairs, Paper2413 Reviewers, Paper2413 Authors, Paper2413 Area Chairs, Paper2413 Senior Area ChairsShow Revisions Responsible NLP Research: pdf Note From EiCs: These are the confidential supplementary materials of the submission. If you see no entries in this comment, this means there haven't been submitted any.