Anonymous
16 Nov 2021 (modified: 14 Jan 2022)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: High-quality phrase representations are essential to finding topics and related terms in documents (a.k.a. topic mining). Existing phrase representation learning methods either simply combine unigram representations in a context-free manner or rely on extensive annotations to learn context-aware knowledge. In this paper, we propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to the pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning (CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly. UCTopic outperforms the state-of-the-art phrase representation model by 38.2% NMI in average on four entity clustering tasks. Comprehensive evaluation on topic mining shows that UCTopic can extract coherent and diverse topical phrases.
Revealed to Jiacheng Li, Jingbo Shang, Julian McAuley

14 Nov 2021 (modified: 15 Nov 2021)ACL ARR 2021 November Submission
Software:  zip
Data:  zip
Preprint: yes
Preferred Venue: ACL 2022
Consent: yes
Consent To Review: yes

Reply Type:
Author:
Visible To:
Hidden From:
6 Replies
[–][+]

Supplementary Materials by Program Chairs

ACL ARR 2021 November Program Chairs
14 Jan 2022ACL ARR 2021 November Paper620 Supplementary MaterialsReaders: Program Chairs, Paper620 Reviewers, Paper620 Authors, Paper620 Area Chairs
Software:  zip
Data:  zip
Note From EiCs: These are the confidential supplementary materials of the submission. If you see no entries in this comment, this means there haven't been submitted any.
[–][+]

Meta Review of Paper620 by Area Chair WTjX

ACL ARR 2021 November Paper620 Area Chair WTjX
06 Jan 2022ACL ARR 2021 November Paper620 Meta ReviewReaders: Paper620 Senior Area Chairs, Paper620 Area Chairs, Paper620 Authors, Paper620 Reviewers, Program Chairs
Metareview:

The paper proposes a method to pre-train a phrase encoder using an intuition that phrases of the same entity should have similar encodings and phrases of different entities should have different encodings. Furthermore, a clustering approach known as cluster assisted contrastive learning in reducing noisy in-batch negatives were proposed for improving the negative samples. Evaluation of the encoder on entity clustering and topical phrase mining shows the effectiveness of the proposed approach compared to other methods.

Summary Of Reasons To Publish:

The paper includes extensive experiments on the downstream tasks where the proposed model significantly outperforms other methods. In addition, a detailed discussion is also provided for different datasets to further analyse the effectiveness of the proposed method with respect to informativeness, diversity of the constructed phrases, and the source of phrase semantics. The method used for topic identification based on phrase encodings clustering works well, especially when the encoder is further fine-tuned (i.e., original fine-tuning process into a topic-specific fine-tuning process) on the task.

Summary Of Suggested Revisions:

The explanation in one of the figures on phrase semantics determined by their context and phrases that have the same mentions have the same semantics are not very clear. As for the examples in Figure 1, the semantic of the phrase “ United States” is fixed and not influenced by the context. The authors could provide a more detailed description of how to construct the pre-training phrases, the presence of pre-training phrases, and how these phrases overlapped with the downstream tasks.

Overall Assessment: 4 = There are minor points that may be revised
[–][+]

Official Review of Paper620 by Reviewer Km9g

ACL ARR 2021 November Paper620 Reviewer Km9g
27 Dec 2021 (modified: 27 Dec 2021)ACL ARR 2021 November Paper620 Official ReviewReaders: Program Chairs, Paper620 Senior Area Chairs, Paper620 Area Chairs, Paper620 Reviewers, Paper620 Authors
Paper Summary:

The paper proposes a contrastive learning-based method for phrase representations and topic mining. Results show that the method achieves the best performance on topic mining and phrase representations. The proposed model can extract more diverse phrases.

Summary Of Strengths:
  1. The paper applies unsupervised contrastive learning to topic modeling, which is suitable for the unsupervised task. Results show that the proposed model can extract more diverse phrases.
  2. The authors also find that in the finetuning process, in-batch negative samples have a bad influence on the performance. So they propose a topic-assist contrastive learning method to reduce noises and turn the original finetuning process into a topic-specific finetuning process.
  3. Experiemnts results show that the proposed model achieves good performance on several datasets.
Summary Of Weaknesses:
  1. It seems that the biggest contribution of the paper is to apply contrastive learning to topic modeling, which is of limited novelty.
  2. Batch is a sampling method, so what is the major difference between batch and the proposed one? This limits the performance significantly.
Comments, Suggestions And Typos:

Suggestions:

  1. It is a little bit confusing in the assumptions in section 1. First of all, “The phrase semantics are determined by their context.”. As for the examples in Figure 1, the semantic of phrase “ United States” is fixed and not influenced by the context. I think the writers want to express: if we mask “United States”, we can still infer the mask phrase by its context.
  2. Writing should be strengthened.
Overall Assessment: 3.5
Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design.
Best Paper: No
Replicability: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.
Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work.
Software: 3 = Potentially useful: Someone might find the new software useful for their work.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
[–][+]

Official Review of Paper620 by Reviewer kjKn

ACL ARR 2021 November Paper620 Reviewer kjKn
27 Dec 2021 (modified: 27 Dec 2021)ACL ARR 2021 November Paper620 Official ReviewReaders: Program Chairs, Paper620 Senior Area Chairs, Paper620 Area Chairs, Paper620 Reviewers, Paper620 Authors
Paper Summary:

This paper proposes a contrastive learning framework to learn phrase representations in an unsupervised way. Cluster-assisted contrastive learning is proposed to reduce noisy in-batch negatives by selecting negatives from clusters. Extensive experiments on entity clustering and topical phrase mining show the effectiveness of the proposed methods. Case studies are also provided that demonstrate coherent and diverse topical phrases can be founded by UCTopic without supervision.

Summary Of Strengths:
  • The paper is well-written and clearly presented.
  • The proposed cluster-assisted contrastive learning objective is well-motivated and effective when finetuning the encoder on the target task further. Extensive experimental results are provided to show the significance of the proposed UCTopic versus baseline methods.
  • Detailed discussion is also provided for different datasets to further analyze the effectiveness of the proposed method with respect to informativeness, diversity of the constructed phrases, and the source of phrase semantics.
Summary Of Weaknesses:
  • The details of how to apply K-Means methods to obtain the pseudo labels when using the CCL and how the number of clusters will affect the final performance is missing.
Comments, Suggestions And Typos:
  • Given that sentence length might affect in Table 1, additional statistics of the pre-trained sentence length versus the might be good to provide.
  • Could the author provide a more detailed description of how to construct the pre-training phrases and how many pre-training phrases are present and how these phrases overlapped with the downstream tasks?
Overall Assessment: 3.5
Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design.
Best Paper: No
Replicability: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work.
Software: 4 = Useful: I would recommend the new software to other researchers or developers for their ongoing work.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
[–][+]

Official Review of Paper620 by Reviewer Hsyr

ACL ARR 2021 November Paper620 Reviewer Hsyr
27 Dec 2021ACL ARR 2021 November Paper620 Official ReviewReaders: Program Chairs, Paper620 Senior Area Chairs, Paper620 Area Chairs, Paper620 Reviewers, Paper620 Authors
Paper Summary:

Problem as proposed by the paper - There is a need for good quality phrase representations for topic mining problems. Existing techniques simply combine unigrams into n-grams or rely on extensive annotations. Solution Proposed - The paper proposes a contrastive learning based approach to learn phrase representation. Paper proposes techniques for getting positive and negative samples without annotated data. Paper also provides a clustering based approach for improving the negative samples. All of these put together is called cluster assisted contrastive learning. Paper shows improvement over existing methods for entity clustering and topical phrase mining.

Summary Of Strengths:
  1. Problem of learning phrase representations is very relevant to the entire field of NLP and not just for topic mining.
  2. Solutions proposed is novel and has potential to be expanded beyond the scope of topic mining proposed in the paper.
  3. Experiments are extensive (for the problems under consideration )and results show improvement over existing methods
Summary Of Weaknesses:
  1. The assumptions made for picking positive instances ( context of the same mention will be the same ) could have been explored well.
  2. Paper in general and experiments in particular limits itself to a few specific topic mining problems ( entity clustering and topic clustering ). It leaves the general applicability of the technique unclear.
Comments, Suggestions And Typos:

Description of what is fine-tuning vs pretraining is confusing. Why is UCTopic w/o CCL retraining and CCL fine-tuning. This may be unimportant in itself but it tends to confuse any further fine-tuning tasks needed for entity clustering.

Overall Assessment: 4 = Strong: This paper is of significant interest (for broad or narrow sub-communities), and warrants acceptance in a top-tier *ACL venue if space allows.
Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.
Best Paper: No
Replicability: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
Datasets: 2 = Documentary: The new datasets will be useful to study or replicate the reported research, although for other purposes they may have limited interest or limited usability. (Still a positive rating)
Software: 2 = Documentary: The new software will be useful to study or replicate the reported research, although for other purposes it may have limited interest or limited usability. (Still a positive rating)
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
[–][+]

Official Review of Paper620 by Reviewer jJXS

ACL ARR 2021 November Paper620 Reviewer jJXS
25 Dec 2021 (modified: 25 Dec 2021)ACL ARR 2021 November Paper620 Official ReviewReaders: Program Chairs, Paper620 Senior Area Chairs, Paper620 Area Chairs, Paper620 Reviewers, Paper620 Authors
Paper Summary:

The paper proposes a method to pre-train a phrase encoder using an intuition that phrases of the same entity should have similar encodings and phrases of different entities should have different encodings. Moreover, the paper evaluates the encoder on entity clustering and topical phrase mining showing superior results compared to other methods. Finally, the paper shows a way to identify topics without supervision by clustering phrase encodings in the corpus. Furthermore, it fine-tunes the encoder to separate encodings of phrases from different clusters further from each other.

Summary Of Strengths:

The paper includes extensive experiments on the downstream tasks where the proposed model significantly outperforms other methods. Moreover, the method to identify topics by phrase encodings clustering seems to work pretty well, especially if the encoder is further fine-tuned on the task.

Summary Of Weaknesses:

I'm a little concerned about the novelty of the pre-training method for the phrase encoder.

In particular, (FitzGerald et al., 2021) have used the same intuition to train the phrase encoding for the entity linking task. In another paper, Soares et al., 2019 trained a relation extraction model with an intuition that sentences with the same pair of entities express similar relation between entities. There might be some differences between the methods, but I wonder how significant they are. Therefore, it's essential to discuss similarities and differences between these prior works in the paper.

Nicholas FitzGerald, Daniel M. Bikel, Jan A. Botha, Daniel Gillick, Tom Kwiatkowski, and Andrew McCallum. MOLEMAN: mention-only linking of entities with a mention annotation network. ACL-IJCNLP 2021. https://aclanthology.org/2021.acl-short.37.pdf

Baldini Soares, L., FitzGerald, N., Ling, J., and Kwiatkowski, T. Matching the blanks: Distributional similarity for relation learning. ACL 2019. https://aclanthology.org/P19-1279.pdf

Comments, Suggestions And Typos:

I'm a bit confused about what in-batch negatives are for the topic modeling task. Do you mean phrases from different documents?

Overall Assessment: 3.5
Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design.
Best Paper: No
Replicability: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work.
Software: 3 = Potentially useful: Someone might find the new software useful for their work.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.