============================================================================ EMNLP-IJCNLP 2019 Reviews for Submission #1842 ============================================================================ Title: Scalable and Accurate Dialogue State Tracking via Hierarchical Sequence Generation Authors: Liliang Ren, Jianmo Ni and Julian McAuley ============================================================================ META-REVIEW ============================================================================ Comments: This paper formulates DST as a sequence generation task, which achieves better scalability with lower complexity. All reviewers agreed that the paper is well-written and the formulation is interesting. Some reviewers concerned about the experimental results because the authors did not show the results of the proposed model without using BERT, and the authors replied that they conducted the experiments and will show them in the revision. ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper formulate the dialogue state tracking problem as a sequence generation task. The goal is to generate item by item the slot and its values condition on system, user utterances. Compared with conventional pair-wise binary prediction formulation, generation style one is claimed to be scalable and have O(1) inference time complexity. The proposed HSGN model substantially outputformed several strong competitors on MultiWoz2 dataset. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- The problem formulation here is quite interesting, though similar idea [1] has been applied directly to dialogue modelling. The best model substantially outperforms the baseline, setting the state-of-the-art for MultiWoz. The paper is generally well-written and easy to follow. The formulation is well motivated in the introduction. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- The comparison with baselines might not be fair enough. The proposed HSGN model uses BERT_large as feature encoder but the competitors does not. As shown in literature that using BERT as the pretraining initiator can boost the performance for several downstream tasks. As such it is hard to tell if the improvements come from the model strucutre or BERT. It would be more convincing to compare the runtime latency between HSGN and models of pair-wise prediction formulation. Having BERT_large for the runtime system might significantly increase the latency. The inference time complexity (ITC) of HSGN is claimed to be O(1). But since it involves a generation process for slot values, the ITC might be O(V) where V is the vocabulary. Compared with StateNet PSI which has O(N) ITC, V is usually larger than N. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 3 Questions for the Author(s) --------------------------------------------------------------------------- Can you explain how can CMR make predictions on a dynamic vocabulary? What is the average number slot-value for each state? Can you compare the ITC between StateNet PSI (O(N)) with HSGN in a more strict derivation? --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- [1] Lei, Wenqiang, et al. "Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018 --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper formulates DST as a generation task to reduce computational complexity that increases proportionally to the number of pre-defined slots that need tracking. This DST generator applies hierarchical encoder-decoder structure. The results show that the proposed DST model achieves an improvement over the previous state-of-the-art on MultiWOZ dataset and comparable performance on WOZ2.0. The main issue of this paper is that the results without the pre-trained embedding model BERT are not given. I'm not sure whether the improvement of performance comes from the model structure or BERT. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- 1)The paper is well-written and readable. 2)This paper formulates DST as a generation task. The slots are directly generated instead of giving by the ontology. The computational complexity is reduced. 3)The provided model applies the hierarchical encoder-decoder framework to generate the slots and then generate the corresponding slot values. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- 1)The results without the pre-trained embedding model BERT are not given. I'm not sure whether the improvement of performance comes from the model structure or BERT. 2)The model with the pre-trained embedding model BERT doesn't outperform previous models without BERT. The performance on MultiWOZ dataset only achieves a 3.57% improvement. Since the performance of the baseline (without BERT) is very low, I do not think the improvement is encouraging. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 2.5 Questions for the Author(s) --------------------------------------------------------------------------- 1)What's the performance of the proposed model without the pre-trained embedding model BERT? 2)When training/testing, what's the previous belief state you use for predicting the next belief state? the golden dialogue state or the predicted dialogue state by the model? If the predicted dialogue state is used, it seems that the model may have the error propagation problem, i.e. if it gives the wrong prediction in some turn, it's hard to recover the true dialogue state in subsequent turns. --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems, ACL 2019. --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper addresses the problem of dialogue state tracking in goal-oriented dialogue systems and proposes treating the task as a sequence generation problem (using a hierarchical encoder-decoder model with attention), in contrast to the standard approach of treating it as a pairwise prediction problem. The main strengths of the paper are 1) the novelty of the approach, 2) the clarity of the presentation, and 3) results that are competitive with state-of-the-art systems but that were obtained at a lower computational cost. The main weakness is that the performance is not demonstrably superior to other systems on the WoZ 2.0 data set. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Interesting approach to dialogue state tracking. Methodology is described clearly, public data sets are used for the experiments, and results are likely replicable. Useful comparison of computational complexity across different models that have been proposed in the literature. Ablation study helps to interpret how the model works. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- No improvement to state-of-the-art results on the WoZ 2.0 data set (however, the results were obtained at a lower computational cost, which is a benefit of the proposed approach even if the empirical results are similar). --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 4 Questions for the Author(s) --------------------------------------------------------------------------- N/A --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- N/A --------------------------------------------------------------------------- Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- N/A ---------------------------------------------------------------------------