Title: Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models Authors: Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley and Lina Yao Status: Accept Review #1 Paper Summary* This paper presents a novel approach to Adaptive Retrieval-Augmented Generation (ARAG) for large language models (LLMs), termed Embedding-Informed ARAG (EI-ARAG). The authors argue that retrieval mechanisms may not always enhance the quality of responses, especially when LLMs already possess the necessary knowledge from pre-training. Instead of relying on pre-training data or additional model inferences to determine the necessity of retrieval, EI-ARAG leverages the contextualized pre-trained token embeddings to evaluate the model's intrinsic knowledge. The paper hypothesizes that these embeddings can effectively indicate when external knowledge is required, and extensive experiments demonstrate the superior performance of this method across various benchmarks. Summary of Strengths* Novel Methodology: The EI-ARAG approach offers a fresh perspective on adaptive retrieval by utilizing pre-trained embeddings, providing an efficient alternative to existing methods that require external data access or additional LLM calls. Empirical Results: The authors present extensive experimental evidence showing that EI-ARAG outperforms several state-of-the-art baselines in both accuracy and efficiency. Summary of Weaknesses* Generalizability: While the paper demonstrates strong results on specific datasets (PopQA and TriviaQA), there is limited discussion on the generalizability of the approach to other domains or types of queries. This could impact the perceived robustness of the findings. Experimental Diversity: The experiments primarily focus on QA tasks. Including a broader range of NLP tasks could provide a more comprehensive evaluation of the proposed method's versatility. Comments, Suggestions and Typos* Questions: For Figure 1: why are the visualization results of the last layer similar to those of the 0th layer, resulting in a lack of clear patterns? Could you describe the details of how the "overall accuracy of EI-ARAG" is calculated in Figure 2? Has EI-ARAG been validated on larger models, such as the 13B/72B models? Intuitively, the larger the model, the more knowledge it possesses, which suggests that the benefits gained from RAG would be smaller, right? Besides Large Language Models, can the proposed EI-ARAG be applied to Multimodal Large Language Models? Suggestions: It would be helpful to include more details on the methodology used to analyze the embeddings and their relationship to knowledge representation. Typos: Line128: iw -> is Line132: on -> one Line133: then -> than Line 567: detcet -> detect mismatch format: "0th Layer" in Figure 1 vs "0-th Layer" in Table 4 Soundness (1-5): 3 Overall Assessment (1-5): 3 Review #2 Paper Summary* The paper describes a novel way of performing adaptive retrieval-augmented Generation (ARAG). ARAG aims to improve question-answering performance in LLMs by retrieving external information when the LMM itself lacks the necessary information. Previous methods determine when to retrieve additional information based on word frequencies in the original dataset, which requires knowledge of the original dataset, or based on additional LLM queries, which can be time and resource-intensive. Instead, the authors propose using a classifier trained on word embeddings to determine whether retrieving additional information is necessary. The proposed method is faster, performs retrieval less often, and outperforms several existing retrieval methods. Summary of Strengths* The proposed methodology seems novel, easy to implement, and can be used to improve existing question-answering LLM systems. The paper is well-written and concise and presents the idea in an understandable manner. The evaluation is comprehensive, showing both that the model outperforms existing approaches and demonstrating how word embeddings contain sufficient information for determining when to retrieve additional knowledge. Summary of Weaknesses* Chapter 2, describing the embedding-informed retrieval, is missing some specifics related to the approach (e.g., exactly what and how much data was used to train the classifier and what kind of classifier was used; the paper only specifies that data was sampled randomly from the training set and that the classifier was a neural network). This could make it difficult to replicate the results. Additionally, the computational cost of training this additional classifier is not included when comparing the computational costs of different approaches, even though it seems like it would add a significant computational cost. The description of some related work is somewhat unclear. In Chapters 1 and 2, the authors state that (Mallen et al., 2023)) access to the pretraining dataset is required to compute entity frequency, while Appendix A states that retrieval is based on the popularity of entities on Wikipedia. Comments, Suggestions and Typos* The paper is well-written but there are some spelling and grammatical errors that could be fixed: particularly the use of knowledge vs knowledgeable (e.g., after line 126 and on line 149). Soundness (1-5): 3 Overall Assessment (1-5): 3 Ethical Concerns The authors address the need to verify the outputs of LLMs due to their tendency to generate incorrect or biased outputs. Review #3 Paper Summary* This work propose a method to automatically decide when to rely on retrieval under the framework of retrieval-augmented generation (RAG) in large language models (LLMs). RAG is known to enhance the quality of response by an LLM by injecting the knowledge retrieved from an external knowledge base, but its performance might be degraded if the LLM already has the knowledge relevant to the given query. This work add a simple classifier to determine whether to retrieve or not using the output of the first layer of an LLM, i.e., LLama 2, which has enough contextualized information of a query. Experiments show that the simple classification is effective in improving the quality of RAG without incurring extra latencies in inference. Summary of Strengths* The use of classifier is well motivated and sounds reasonable to me. The use of the output from the first layer is also practical when considering the cost of inference and the information from the contextualized embedding of the input queries. The proposed method is simple yet effective, and it is empirically demonstrating its effectiveness under two QA benchmarks. Summary of Weaknesses* It is better to measure the impact of the classification accuracies to the end-to-end performance. At least, I'd expect the report on the classification performance of the proposed method. Comments, Suggestions and Typos* The research question 3 is not answered explicitly but found as a part of the answer to the research question 1. I'd expect more clear answer. Soundness (1-5): 4 Overall Assessment (1-5): 4