============================================================================ ACL-IJCNLP 2021 Reviews for Submission #3548 ============================================================================ Title: Weakly Supervised Named Entity Tagging with Learnable Logical Rules Authors: Jiacheng Li, Haibo Ding, Jingbo Shang, Julian McAuley and Zhe Feng ============================================================================ META-REVIEW ============================================================================ Comments: While the application of the proposed method could be limited, it is good to have this type of paper that learns interpretable rules, which is missed in most of the recent research. ============================================================================ REVIEWER #1 ============================================================================ The core review --------------------------------------------------------------------------- This paper proposed a tagging framework TAILOR for NER, which can automatically generate compound logical rules from a small number of simple rules and provide pseudo labels without knowing the entity spans. The method addresses two problems, how to decide the entity boundaries and types, and how to get pseudo labels from the learned rules. Experimental results on three datasets show that TAILOR can get good performance compared to other weakly supervised methods . Interesting part is that the learned rules are interpretable. Will these learned rules be debugged to generate better pseudo labels? --------------------------------------------------------------------------- Reasons to Accept --------------------------------------------------------------------------- The method is innovative, experimental results show its effectiveness . It can start from a couple of really simple seed rules (TokenStr), and automatically generate more precise compound rules, which is also interpretable. This approach can be very useful for NER datasets with few annotations. Though some part is still not clear, the whole system is enlightening. --------------------------------------------------------------------------- Reasons to Reject --------------------------------------------------------------------------- Still not clear how much TAILOR rely on a good POS Tagger and a Dependency Parser. It will be better to add some comparative experiments to explore it. It seems that how to generate the rule candidates set (new rules) from the pseudo-labeled data and seed rules is not clear. Why author choose these 5 types of rules (see questions). --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 3.5 Questions for the Author(s) --------------------------------------------------------------------------- It still not clear to me how to generate rule candidates, by enumerating all possible rules extracted by the pseudo-labeled data? What the size of rule candidate set? What the speed of the method. In Table 4. it seems like there are only 5 type of rules. For example, I think rules like "Pre \and TokenStr" is also very important. --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ The core review --------------------------------------------------------------------------- This paper proposes a method for automatically generating rules for detecting named entities, starting from a small set of manually written rules. All the rules are conjunctions of simple conditions (such as token identity, POS and dependency relation of a token and its neighbors), suitable for human examination. It considers quality metrics for the bootstraped rules. The method deals with discovering interpretable rules, a type of approach that currently gets very little attention but that can still be important in some scenarios. It shows good experimental results, with a high F-1 score in three datasets. I am not very familiar with this topic, but the contribution of the paper is rather small. --------------------------------------------------------------------------- Reasons to Accept --------------------------------------------------------------------------- The paper describes experiments well done and relevant results, and shows an interesting use case of a rule based approach. --------------------------------------------------------------------------- Reasons to Reject --------------------------------------------------------------------------- The contribution seems rather small. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 3.5 Questions for the Author(s) --------------------------------------------------------------------------- I would like to see examples of when noun phrases do not align with entities. If not possible in the introduction when it is first mentioned, then at least in the appendix. Why compute the category global embedding from a sample only, but compute local embeddings with all entities in the high-precision set? Why choosing the top 70% instead of some threshold? Setting a limit of the chunks may eliminate some good ones and include bad ones. --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ The core review --------------------------------------------------------------------------- This paper presents a weakly supervised method to bootstrap a name entity tagging model from a small set of atomic rules. In particular, the proposed method iterates between proposing new logical rules, refining the rules, applying rules to expand the training data, and training neural tagger to detect named entities. The proposed logical rules are constructed from combining simple rules about the content, context, POS tags, and dependency relations with conjunctions. To maintain good quality in the weakly labeled examples to train the neural span tagger, the entities detected by the logical rules is compared to a threshold calculated from one held out entity instance. In addition to that, the threshold is dynamically adjusted during different iterations to balance between reliability and exploratory. Empirical evaluations on three datasets show that the proposed method significantly improves the accuracy, and outperforms several existing methods. In general I think this is a solid paper. The description of the proposed method is detailed. The empirical evaluation is convincible. The ablation experiments give more insights about the proposed method. One of my small concern is that this paper describes a complicated system which might hinder the application in new domains. --------------------------------------------------------------------------- Reasons to Accept --------------------------------------------------------------------------- 1. This work has good impact. Building an NER system for a new domain is challenging and time consuming for many cases. The method proposed in this paper is generally applicable to most new domains. 2. The proposed method provides a set of logical rules that are more explainable. It makes it easier for human to review and identify potential issues. --------------------------------------------------------------------------- Reasons to Reject --------------------------------------------------------------------------- The proposed method is of certain complexity. Even with the many details mentioned in the paper, there might still be some difficulties to apply this method in new domains. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 3.5 Questions for the Author(s) --------------------------------------------------------------------------- 1. It is not clear to me how the numbers about noun phrase based NER are measured. It would be good to give more details. 2. In Table 4, it seems the combination of logical rules are limited to bigram. Is that correct? Is there any value in combining more logical rules? 3. Line 340, why E_s is sampled from H_i, instead of using H_I? Is it just for efficiency reasons? 4. Line 372, can you give more details about how \tau is set? 5. Table 2, what is the dictionary size for AutoNER? ---------------------------------------------------------------------------