Blind Submission by November • Achieving Conversational Goals with Unsupervised Post-hoc Knowledge Injection
Meta Review of Paper1854 by Area Chair 6KeN
This paper proposes a new method for response generation in dialog systems where the responses are generated from a combination of language model and a knowledge source such as a database of text snippets (e.g. descriptions of restaurants).
Previous work on incorporating knowledge into response generation systems has required the knowledge source to be available at training time. This can make it challenging to update the knowledge source and limits agility.
In contrast, the method proposed here allows arbitrary knowledge to be incorporated in a response by applying the integration at decoding time, without any retraining of the conversation model. This involves a novel application of different decoding strategies to balance exploration of the output space and integration of knowledge as constraints.
Experiments on multiple dialog tasks using different kinds of knowledge sources (including snippets decoded from a GPT language model) show that the method performs very well compared to alternative methods, on both automatic and human evaluations.
The method is novel and has many attractive properties - as well as scoring highly on response quality evaluation it introduces the ability to add or update knowledge sources on the fly, which was not possible with previous approaches in this area. The reviewers all comment that the ideas are strong.
The evaluation is comprehensive and convincing.
Reviewers BJEf and DUsa have identified a number of points where the paper is unclear or leaves questions unanswered. Addressing these points will improve the readability and impact of the paper.
Reviewer DUsa has also identified some possible gaps/confounds in the human evaluation, e.g. that there may not have been a control for the effect of sentence length. It would strengthen the evaluation if these could be addressed, either by analysis of the data that was already collected or by additional experiments.
All *ACL venues would be suitable.
Official Review of Paper1854 by Reviewer ZhvD
This paper proposed a novel system for knowledge-grounded dialogue response generation. Specifically, the author initialed a response from an existing dialogue model, generated and selected several knowledge snippets from both parametric and non-parametric knowledge sources, decoded with injected knowledge, and ranked candidate responses. This procedure allowed to generate more diverse and creative responses without losing fluency, which could be verified through their solid automatic and human evaluation.
This system is compatible with existing dialogue models and flexible with multi-scenarios. When absorbing newer knowledge, the only thing to do is to fine-tune PTMs or update Yelp reviews.
The writing is clear and reasonable and the experiments are solid and diverse, which includes both automatic evaluation and human evaluation.
The approach of sub-selecting relevant but diverse knowledge leads to responses that also promote success in achieving conversational goals.
My main concern with the paper is its lack of evaluation against some advanced baselines for knowledge-grounded dialogue generation and lack of references, especially the Wizard-of-Wikipedia:
- Zhao et al., Knowledge-Grounded Dialogue Generation with Pre-trained Language Models. In EMNLP, 2020.
- De Bruyn M, Lotfi E, Buhmann J, et al. Bart for knowledge grounded conversations. In Converse@KDD, 2020.
- Kim et al., Sequential Latent Knowledge Selection for Knowledge-Grounded Dialogue. In ICLR, 2020.
- Shrimai Prabhumoye et al., Focused Attention Improves Document-Grounded Generation. In NAACL, 2021.
I am curious whether the knowledge selection performance of the proposed method can be compared with existing models, especially the accuracy of knowledge selection (e.g. Table-1 in SKT paper of Kim et al., ICLR-20)
Please refer to the "minor" above.
Official Review of Paper1854 by Reviewer MxzE
The manuscript proposes a new approach of injecting knowledge to alleviate some issues that existing dialog models suffer from a lack of specificity and informativeness. This paper proposes POKI (Post-hoc Knowledge Injection) technique, which generates a more informative and engaging response based on retrieved knowledge in dialogue systems without additional training. Experiments on task-oriented dialogue (MultiWOZ) and knowledge-grounded dialogue (WoW) benchmark datasets show that the proposed model can generate more engaging and informative responses.
[S1] This paper is well-motivated and proposes a new idea of generating informative response in an unsupervised knowledge injecting manner based on retrieved knowledge.
[S2] Experimental results on two benchmark datasets (MultiWoZ, WoW) are impressive.
[S3] The analysis is comprehensive to verify the proposed methods.
[W1] Some of the models have been adopted from existing work (e.g., retrieving knowledge, DPP). A shallow analysis is conducted for knowledge selection.
[C1] I like the detailed and comprehensive experiment setups. It would be nice to add more case studies including error cases.
Official Review of Paper1854 by Reviewer BJEf
This paper aims at increasing the diversity of the generated utterances by injecting related knowledge to the model generated responses. The knowledge injection happens only in the response inference process and requires no extra training, which leads to the major advantage of the proposed method, i.e. the flexibility in adding newer knowledge sources without model retraining.
The proposed method firstly collects a set of relevant knowledge snippets conditioned on both the dialogue history and an initial response generated using an existing dialogue model. The knowledge snippets are collected by (1) extracting from pretrained language models, i.e. the parametric knowledge sources, using prompts, and (2) retrieving from text corpus, i.e. the non-parametric knowledge sources. Then the proposed method constructs multiple candidate responses by individually injecting each knowledge snippet into the initial response using a gradient-based decoding method. Finally, the candidates are ranked based on their diversity and coherence to the dialogue history.
(1) The paper is well written.
(2) The idea of injecting related knowledge to the model generated responses without any extra training is very interesting.
(3) The authors conducted a large amount of automatic and human evaluation and analysed the model’s performance sufficiently.
(4) The experiment results and showcases demonstrated the effectiveness of the proposed method.
(1) Focusing on increasing the response diversity of goal-oriented dialogue generation does not sound reasonable. The aim of goal-oriented dialogue generation is helping the users achieve their goal efficiently. Maybe chit-chat response generation is a better testbed for the approach.
(2) The reason and influence of introducing the parametric knowledge sources. It seems that the generated knowledge snippets are suffering from a low factuality problem (Table 5). Why did the authors introduce this knowledge source? Even though, with the help of DPP, the factuality of the knowledge snippets is increased, would it be better to use only the non-parametric knowledge sources? I suggest the authors conduct experiments to analyse the influence of each knowledge source. I’m also questioning the effects of DPP. It seems that DPP only considers the redundancy and the relevance of each snippet. How could it improve the factuality of the generated knowledge snippets (middle column of Table 5)?
(3) Using GPT-2 as LM for filtering. GPT-2 is pre-trained using WebText. The language features of WebText are very different from dialogues. Maybe it would be better to use models pre-trained on dialogue-related corpora, e.g. PLATO (https://aclanthology.org/2020.acl-main.9.pdf).
(4) Important details are missing/not clearly demonstrated:
(a) [Section 2.1] It is not clear how to choose which key-phrase in the context to focus on for the prompt generation?
(b) [Section 2.2] Which text corpus does each task use as non-parametric knowledge sources? It is understandable to use Yelp as the external knowledge for goal-oriented dialogue generation (MultiWOZ). However, it is not very reasonable to use Yelp as the external knowledge for knowledge-grounded dialogue generation (WoW). Also, it is not very clear whether the authors used the associated knowledge presented in WoW as the knowledge source. I suggest the authors clearly explain this in Section 4.1;
(c) [Section 2.2] The authors did not explain how to retrieve the relevant knowledge instances from the text corpus (Line 160);
(d) [Section 2] The authors did not offer statistical analysis for the retrieved snippets, e.g. on average how many snippets are generated for each prompt, how many snippets are retrieved from the text corpus, how many snippets are left after the knowledge selection (Section 3.1);
(e) [Section 3] It is not clear how to keep the redundancy score symmetric (Line 201). As far as I understood, PMI(k_i, k_j) is not equal to PMI(k_j, k_i);
(f) [Section 3.2] The structure of the entailment classifier is not clear. Also, I am very confused about the fine-tuning process of the entailment classifier. It seems that the inputs of the classifier during the inference process is the hidden representation (output of the response decoder) of the initial response x^f and the tokens of dialogue history. However, if the authors use normal tokens for the DNLI fine-turning, the input of the classifier is just token embeddings, which is different from the hidden representation coming out of the response decoder. How could the authors guarantee the performance of the entailment classifier? The authors also did not report the performance of the classifier.
Official Review of Paper1854 by Reviewer DUsa
The authors present a method to "inject" knowledge retrieved from some corpus (e.g., Wikipedia) or generated from a PTLM (e.g., GPT-3) into responses from a conversational agent in both task-oriented and knowledge-grounded conversations. The knowledge injection involves a combination of retrieving appropriate knowledge from different sources, ranking it (using a novel approach), and evaluating its appropriateness in context (via entailment) and as natural text. Finally, they produce multiple outputs and rank these to select the best.They present a detailed comparison of several aspects of the system in both contexts, considering both automatic metrics and human judgements. They report that their system, POKI, is a significant and large improvement over several recent baselines. Overall the work is strong, but could be made much better with some hopefully minor improvements. This work has the potential to become very influential, and I would encourage the authors to make those improvements.
The method proposed is a novel combination of ideas, some of which are new to me. The strategy for knowledge injection in particular should be useful for many others interested in controllable NLG and similar knowledge injection problems. The strategy for ranking knowledge extracts (DPP) should be quite useful to others as well; while it isn't new, it is a useful application, and many readers won't be familiar with it.
The paper is clearly written and with only a few minor exceptions I found it both informative and easy to read. It was easy to see the progression of ideas and how they developed from previous related work.
The authors evaluate their work thoroughly in two important contexts: knowledge grounded and task oriented conversations, so it is clear it is applicable to major problems in conversational AI.
The approach described avoids expensive re-training of language and seq2seq models, instead injecting retrieved knowledge at runtime into a generated response.
- The main weaknesses I see relate to the human evaluation of the work. I am concerned that it is possibly too easy to tell the source of the two responses that are compared in their pairwise task (Figure 4 in the appendices). I would like to see much more detail about the human evaluation:
a. Two annotators is not really enough. Three is a minimum.
b. Did you randomize the order in which they generated responses were shown? Or try to estimate the positional bias?
c. The human evaluation should have compared against KCopy baseline as well. I noted that the examples in Figure 3 are not too different from simply appending the knowledge snippet to the original generated text. For instance, in the second turn, the final response includes verbatim text from the selected knowledge. It isn't clear that the extra step of forward-backward incorporation of the knowledge into the generated text is actually necessary.
d. Again, in the example in Figure 4, it is very easy to tell which example response is which. Are you sure that annotators weren't just picking the longer of the two examples? You should evaluate whether there are (much) simpler ways of predicting which response the annotators would choose. The numbers in Table 3 are so lopsided that I wonder if the comparison is fair. (Note, this isn't necessarily bad--the POKI example given is clearly much better than the baseline it is compared to, but they are also very different in length. I wonder if the authors could show that POKI produces better quality responses even when the two have similar length.)
- It isn't clear exactly how the forward-backward procedure works, when the original output x^d and the knowledge snippet have different lengths. How do you handle the stopping criteria? More detail is needed here.
L157: you write "generated knowledge" but do you mean "generated text"? PTLMs don't generate knowledge per se, they output text which represents knowledge they have assimilated during pre-training.
L200: technically this equation is not symmetric, and you say you make it symmetric. I can imagine how this is done, but you should state it explicitly.
L217-221: While I figured out what you're doing eventually, that requires me to understand properties of determinants that perhaps not everyone has at their fingertips. This section, especially this sentence, needs to be more clear. It's worth doing, as DPP is a powerful idea, but requires some sophistication to understand easily.
L274-77: this is where I first wondered about the difference in length between x^d (or z) and k. I can imagine how this is handled, but I need to see more details here.
L302: you could be more explicit what the forward pass objective is. Is it towards the original target text? If so, does that push the output away from the knowledge k? Or is the goal simply towards, say, a better score from the GPT LM?
Section 3.3: could you have used DPP here as well? It seems like perhaps you could. Why not do that?
Table 1: Authors need to provide more details of the statistical analysis. For instance, how did they take into account multiple comparisons? What comparisons were actually done? Also, p < 0.05 is a very loose criterion. p < 0.01 really should be the minimum. Have any of the comparisons met that test. (Honestly, I prefer that authors simple state what the value of p was)
L398: why did you choose BART, and not something else, as the baseline for WoW?
Figure 3: Is this conversation from MultiWOZ? If so, please include the ground truth "gold" responses.
Table 4: This is a very small study (if I read correctly, 15 examples per row of this table). Are the results statistically significant?
Table 4 caption: "an additional knowledge (Know) that was not explicitly asked helped" is not grammatically correct. Should read something like "user was helped by some additional knowledge (Know) that was not explicitly asked for"
Line 543: "For the knowledge-grounded setting (WoW), both BART and POKI are knowledge intensive." What does this mean? I don't understand what you're getting at.
Table 5: Why would factuality depend on the retrieval method? This bothers me, and I'd like to see some explanation. Either the information retrieved is factual or it isn't.
Line 572: if word-overlap retrieval is so limited (and this implies that the authors' re-ranking steps don't compensate for this), why not use something better? Or retrieve more examples before re-ranking?
Line 589: 3ms/token will add up, so I don't think this different is insignificant, especially given that the responses you show are fairly long, so this could add up to half a second or more pretty easily.
This paper has a really novel combination of ideas, and while those ideas aren't exactly new it is an exceptionally creative application and combination of those ideas. The paper is for the most part very clearly written and it is clear how one could repeat the work. The approach to knowledge injection is create and economical. I think it will be exceptionally influential, and I am itching to share it with my colleagues.
I don't see any ethical issues with the work.