============================================================================ EMNLP 2020 Reviews for Submission #2444 ============================================================================ Title: Interview: Large-scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding Authors: Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni and Julian McAuley ============================================================================ META-REVIEW ============================================================================ Comments: This work contributes a large-scale media dialog dataset named INTERVIEW from NPR radio programs, and performs large-scale analyses for response generation according to various aspects including dialog patterns, question types, and grounding documents. Good resource paper to have for our community ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- The paper focuses on the analysis of of discourse patterns in media dialog and demonstration of usefulness of external knowledge to ground turn generation. Both discourse structure and grounding are very important for the task, which is often overlooked. The authors bootstrap annotation of these news interviews for polarity, subjectivity and combativeness on interviewer side as a representation of a dialog act. The experiments demonstrate that auxiliary tasks for discourse helps in modeling. The paper presents the topic is a very clear manner. The main strength of the paper is the presentation of the topic and an extensive evaluation. There aren't weaknesses per se except the annotation being silver. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- An interesting work that explicitly addresses the fact that dialogs are about something. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- none --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Overall Recommendation: 4 Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- line 368: "BERT embeddign vectors." --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- *Summary* 1. This work contributes a large-scale media dialog dataset named INTERVIEW from NPR radio programs, and perform large-scale analyses for response generation according to various aspects including dialog patterns, question types and grounding documents. It also introduces two auxiliary losses to better generate a response in the media context, which can improve the generation quality of the GPT-2 decoder. *Positives* 1. The key contribution of this work is to collect a large-scale 105K media dialogue dataset named INTERVIEW. 2. This work performs three types of discourse analysis for INTERVIEW. First, it analyzes the dialog patterns in terms of response type triplets. Second, it investigates the question types in three interrogative aspects and runs a simple test for automatic dialog act classification. Finally, it also proposes a simple technique to link a dialog with facts in the grounding documents. 3. This work also explores how to improve knowledge-grounded response generation in the media dialogue. It proposes two auxiliary losses of (i) look-ahead dialog structure prediction and (ii) question-attribute prediction, both of which are helpful to enhance the quality of utterance generation. *Negatives* 1. The comparison with other media dialogue datasets are rather cursory. The RadioTalk dataset seems much more large-scale, so it could be a better benchmark for media dialogue analysis or NLP research. This work pointed out that they are from non-professional journalist, but I don’t think it is a real problem; on one hand, it could be a merit to study more casual dialogues. 2. In the analysis of knowledge grounding in section 4.3, measuring by perplexity only could be rather inconclusive. It could be useful to tell which linking algorithm is better than the other, but it is hard to exactly know whether the real best document is found for the discourse. In other words, we may be able to conclude that the probabilistic linking is better than TF-IDF linking, but not be sure whether the probabilistic linking is really good. 3. The experimental results in section 6 could not be interesting. (1) One of main results is that the addition of explicit grounding documents helps improving the quality of utterance generation, which sounds too obvious. (2) If I understand correctly, the two auxiliary losses require additional training labels. Thus, it could be also so obvious that the language generation is improved with more labels. (3) That is, it could be less intriguing to reassure that the response generation benefits from adding more external knowledge and more labels. *Conclusion* Generally, I’m positive toward this paper, but has some doubt in the significance of analyses. I will make my decision after the discussion on the authors’ rebuttal. *Post-rebuttal review* My initial concerns were three-fold: (i) lack of detailed comparison with other media dialogue datasets, (ii) unconvincing experiments on linking algorithms in sec 3.4 and (iii) somewhat tepid evaluation in sec 6. The rebuttal is persuasive for (i), so this response may need to be included in the final draft. On the other hand, responses for (ii) and (iii) do not fully resolve my concerns. Therefore, I’d like to keep my initial rate. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Please see positives in item #1. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- Please see negatives in item #1. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Overall Recommendation: 3.5 ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper describes the Interview corpus consisting 105K conversations collected from news interview transcripts and annotated with dialog structure (question and answer patterns) and question type labels. Each conversation is also grounded to a specific document thanks to similarity metrics between the transcription of a conversation and a set of news article related to the corpus collection. The main claim of the paper is that these levels of annotations (question/answer patterns, question types and document grounding) can be very useful in order to perform dialog analysis and help understanding modes of persuasion, entertainment, and information elicitation. The constitution of the corpus and the different levels of annotation added in order to represent dialog structure and content are the main strengths of this paper. However, the second part of the paper is dedicated to dialog generation, the task being predicting the next turn of the host in a broadcast host/guest interview. It is not clear to me how this task is related to the goal of dialog analysis and this constitutes, to my point of view, the main weakness of this paper. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- The main benefit of this study to the NLP community is to provide a new conversation corpus with discourse annotations which differ from previous task-oriented dialog and chit-chat dialog that are currently used in most of current studies on dialog modelling. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- The main issue with this paper is the use of a dialog generation task for assessing the relevance of the different levels of annotation defined in the Interview corpus. Guessing the next turn by the guest in such dialogs seems to me a very difficult and artificial task, where human would fail as well as machines (guests can ask any kinds of question or react to a previous statement in many different ways). It seems to me that the real tasks that would lead to dialog analysis is to go further than discourse patterns and question types and reveal the internal structure of these conversations, for example by linking questions and answers and segmenting a whole dialog into sub dialogs. Going in this direction, it is the dialog generation task that could be an auxiliary task to those in charge of revealing such structures, and not the opposite. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Overall Recommendation: 3.5 Questions for the Author(s) --------------------------------------------------------------------------- It is not clear to me what portion of the Interview 2P corpus has been manually annotated with discourse structure? For question type classification, in table 2 you obtain F-scores ranging from 70 for Combativeness to 80 for Polarity, however in table 3 you present an F-score of 90 for the same task. This is not clear to me: how comparable are these results? Does it mean that the dialog generation task helps predicting accurate question types? This would be an interesting result. In section 4.3 Knowledge Grounding you mention that the documents you are trying to link to each conversation are drawn from the same media, NPR, and are made of news articles from the past two decades. Can’t you use meta data attached to each broadcast show in order to perform this linking, or evaluate your linking models? A direct evaluation of the grounding strategies (what % of documents retrieved are indeed relevant to the conversation targeted) is needed. --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- You mention in section 3.1 that "there is a paucity of media dialog datasets". You cound mention corpora in other languages than English, for example in French : "The EPAC Corpus: Manual and Automatic Annotations of Conversational Speech in French Broadcast News.", LREC 2010 ---------------------------------------------------------------------------