Anonymous
16 Nov 2021 (modified: 14 Jan 2022)ACL ARR 2021 November Blind SubmissionReaders: November, Senior Area Chairs, Area Chairs, Reviewers, Paper1309 Authors
Abstract: In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for training. Specifically, we first present Iterative Contrastive Learning (ICoL) that iteratively trains the query and document encoders with a cache mechanism. ICoL not only enlarges the number of negative instances but also keeps representations of cached examples in the same hidden space. We then propose Lexicon-Enhanced Dense Retrieval (LEDR) as a simple yet effective way to enhance dense retrieval with lexical matching. We evaluate LaPraDoR on the recently proposed BEIR benchmark, including 18 datasets of 9 zero-shot text retrieval tasks. Experimental results show that LaPraDoR achieves state-of-the-art performance compared with supervised dense retrieval models, and further analysis reveals the effectiveness of our training strategy and objectives. Compared to re-ranking, our lexicon-enhanced approach can be run in milliseconds (22.5x faster) while achieving superior performance.
Revealed to Canwen Xu, Daya Guo, Nan Duan, Julian McAuley

15 Nov 2021 (modified: 15 Nov 2021)ACL ARR 2021 November Submission
TL;DR: In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for zero-shot text retrieval.
Preprint: no
Preferred Venue: ACL 2022
Consent: yes
Consent To Review: yes

Reply Type:
Author:
Visible To:
Hidden From:
6 Replies
[–][+]

Supplementary Materials by Program Chairs

ACL ARR 2021 November Program Chairs
14 Jan 2022ACL ARR 2021 November Paper1309 Supplementary MaterialsReaders: Program Chairs, Paper1309 Reviewers, Paper1309 Authors, Paper1309 Area Chairs
Note From EiCs: These are the confidential supplementary materials of the submission. If you see no entries in this comment, this means there haven't been submitted any.
[–][+]

Meta Review of Paper1309 by Area Chair CcQc

ACL ARR 2021 November Paper1309 Area Chair CcQc
03 Jan 2022ACL ARR 2021 November Paper1309 Meta ReviewReaders: Paper1309 Senior Area Chairs, Paper1309 Area Chairs, Paper1309 Authors, Paper1309 Reviewers, Program Chairs
Metareview:

Interesting solution, but not fully dense.

Summary Of Reasons To Publish:
  • 3/4 reviewers acknowledge strong performance on BEIR benchmark
  • 1/4 reviewer acknowledges the efficiency of the solution
  • 1/4 reviewer praises the opensource effort
Summary Of Suggested Revisions:
  • 3/4 reviewers report that the proposed solution is not purely dense but hybrid (with BM25) and it is not clear what contributes the most.
  • 2/4 reviewers suggest more ablations/baselines
  • 1/4 reviewer finds some claims confusing
  • 1/4 reviewer asks clarity on evaluation metrics
Overall Assessment: 3 = There are major points that may be revised
[–][+]

Official Review of Paper1309 by Reviewer YfgT

ACL ARR 2021 November Paper1309 Reviewer YfgT
27 Dec 2021ACL ARR 2021 November Paper1309 Official ReviewReaders: Program Chairs, Paper1309 Senior Area Chairs, Paper1309 Area Chairs, Paper1309 Reviewers, Paper1309 Authors
Paper Summary:

This paper aims to train a fully unsupervised pretrained retriever for zero-shot text retrieval. Specifi- cally, this paper proposes a novel iterative contrastive learning to pretrain dual-tower dense retriever, then use lexical matching method to further enhance the pretrained dense retrieval model. Results on BEIR benchmark show that the proposed method achieves SOTA performance compared with supervised dense retrieval model.

Summary Of Strengths:
  1. The proposed unsupervised dense retrieval model achieves remarkable performance com- pared with supervised dense retrieval models on BEIR benchmark.
  2. Theproposedmodelisinitializedfrom6-layerDistilBERTandtheparametersofdualencoders are tied, so the proposed retrieval model is quite efficient.
Summary Of Weaknesses:
  1. The selling point of this paper is unsupervised pretrained dense retriever(LaPraDoR) can per- form on par with supervised dense retriever, but actually, LaPraDoR is a hybrid retriever rather than a pure dense retriever. In a way, it’s unfair to compare hybrid method to dense/sparse method as shown in table 1, because it’s known that the dense retriever and sparse retriever are complementary. The comparable supervised models should also be hybrid retrievers.Besides, in table 3, it seems that without lexicon enhancement, the performance of proposed unsupervised model is not competitive either on in-domain MS-MARCO or cross domain BEIR benchmark compared with supervised model.
  2. In table 4, the combination of self-supervised tasks ICT and DaPI doesn’t seem to be com- plementary, the effectiveness of DaPI task, which will double the GPU memory usage, is not significant (0.434 -> 0.438)
  3. ICoL is proposed to mitigate the insufficient memory on a single GPU and allow more neg- ative instances for better performance, but there are no corresponding experiments to show the influence of the number of negatives. As far as I know, the quality of negatives is more important than the quantity of negatives as shown in TAS-B.
  4. It sounds unreasonable that increasing the model size can hurt the performance, as recent paper Ni et al. shows that the scaling law is also apply to dense retrieval model, so the preliminary experimental results on Wikipedia about model size should be provided in detail.
  5. Thepaperarguethattheproposedapproachistocomplementlexicalmatchingwithsemantic matching, while the training procedure of proposed model is totally independent with lexical matching. Therefore, the argument ”LEDR helps filter out such noise and allows the dense retriever to focus on fine-grained semantic matching” is confusing, because there is no suc- cession relationship between LEDR and dense retriever.

Reference: * Ni et al. 2021. https://arxiv.org/abs/2112.07899

Comments, Suggestions And Typos:

the proposed LaPraDoR achieves relative low performance on MS-MARCO while relative high per- formance on BEIR, the inductive bias of the proposed pretrain method is worth exploring. Line 300-304: q and d are confusing

Overall Assessment: 2.5
Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.
Best Paper: No
Best Paper Justification:

N/A

Replicability: 2 = They would be hard pressed to reproduce the results: The contribution depends on data that are simply not available outside the author's institution or consortium and/or not enough details are provided.
Datasets: 1 = No usable datasets submitted.
Software: 1 = No usable software released.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
Ethical Concerns:

N/A

[–][+]

Official Review of Paper1309 by Reviewer E1hH

ACL ARR 2021 November Paper1309 Reviewer E1hH
22 Dec 2021ACL ARR 2021 November Paper1309 Official ReviewReaders: Program Chairs, Paper1309 Senior Area Chairs, Paper1309 Area Chairs, Paper1309 Reviewers, Paper1309 Authors
Paper Summary:

A method to train dense retrieval without the need of any supervised data using a combination of ICT and SimCSE. Further, they propose Iterative Contrastive Learning (ICoL) as a new way for caching negative embeddings.

Summary Of Strengths:
  • The paper is well written and easy to understand.
  • The paper conducts many experiments and presents many interesting insights
  • The results are convincing on the BEIR benchmark
  • ICoL appears to be an interesting approach showing improvements over MoCo
  • Overall a strong paper but with some weaknesses in evaluating the individual components / individual changes.
Summary Of Weaknesses:

What is contributing to the performance improvement?

  • Because the paper presents so many novelties, it is a bit hard to grasp what led to the improvement on BEIR, i.e. what are the main factors that contribute to the improvement?
  • It appears (Table 3) the strongest improvement come from Lexicon-Enhanced Dense Retrieval, i.e. combining the dense sim score with a BM25 score. This hybrid approach has been shown effective in several previous works, e.g. https://arxiv.org/abs/2005.00181 or https://arxiv.org/pdf/2004.13969.pdf (and many more)
  • It would be interesting to get the results for other dense + BM25 combinations. The dense retriever appears to be not the strongest (cf. Table 3, col. w/o LEDR), i.e. it is weaker than TAS-B (0.396 vs 0.415). So what would be the results of TAS-B+LEDR?

Pre-training approach

  • A large contribution of the author is the proposal of a new pre-training approach that combines ICT + SimCSE with a large negative cache (ICoL)
  • Table 3 shows an improvement for the sole dense retrieval model if pre-training is used (e.g. w/o LEDR vs. w/O LEDR & PT)
  • However, we know that BERT is undertrained (see RoBERTa paper) and performing more pre-training steps yields an improvemed BERT model
  • A comparison against other pre-training approaches would have been interesting. What happens when I train e.g. with MLM for the same amount of compute on C4 and then do fine-tuning? What about other pre-training methods for dense retrievers (for an overview see https://arxiv.org/abs/2112.07577)? Is the proposed pre-training method actually better than other pre-training strategies?

I think the 3 additions (pre-training, ICoL, hybrid retrieval with BM25) could be better separated / evaluated more individually. It would help to see what are the contributing factors for the improved performance. So far it appears that switching to a hybrid approach with BM25 made the large difference in performance.

Having a more clear separation in their evaluation would help to assess what is really relevant for future methods.

Comments, Suggestions And Typos:
  • Line 231: DaPI was concurrently also proposed by Liu et al (which they call Mirror-BERT https://arxiv.org/abs/2104.08027, published April 16; SimCSE was published April 18)
  • Line 430: You say that the batch size for each GPU is 4,096, so 16*4k = 64k in total for your server. But how do you get a batch of 4,096 on a single 32GB GPU? A batch larger than 100 examples results usually in an out-of-memory error when using huggingface transformers with a distilbert model. Did you use some special off-loading?
  • Maybe I might missed it in the paper: Did you specify your max sequence length for pre-training / fine-tuning? I.e. do you use 512 word pieces or do you pre-train on shorter paragraphs?
  • Table 4: The heading "w/o ICT (SimCSE 2021c)" looks confusing, you would think that ICT was proposed by SimCSE 2021c)
  • Table 4: Adding SimCSE to pre-training appears to bring little effect, performance improves from 43.4 -> 43.8. Could this improvement just be due to randomness? How does the performance change for the fine-tuning setting when pre-training was without SimCSE (i.e. just ICT pre-training followed by supervised MS MARCO training)?
  • Table 5: The three settings, I assume these are the hybrid approaches with BM25. How is the performance only for the dense retrievers without BM25?
Overall Assessment: 4 = Strong: This paper is of significant interest (for broad or narrow sub-communities), and warrants acceptance in a top-tier *ACL venue if space allows.
Confidence: 5 = Positive that my evaluation is correct. I read the paper very carefully and am familiar with related work.
Best Paper: No
Replicability: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.
Datasets: 1 = No usable datasets submitted.
Software: 1 = No usable software released.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
[–][+]

Official Review of Paper1309 by Reviewer K746

ACL ARR 2021 November Paper1309 Reviewer K746
20 Dec 2021 (modified: 27 Dec 2021)ACL ARR 2021 November Paper1309 Official ReviewReaders: Program Chairs, Paper1309 Senior Area Chairs, Paper1309 Area Chairs, Paper1309 Reviewers, Paper1309 Authors
Paper Summary:

This paper proposes an unsupervised pre-training method for dense retriever. They propose ICoL to conduct effective pre-training borrowing the ideas from contrastive learning and ICT. The experimental results show its effectiveness on BEIR benchmark including 18 datasets. There are some weaknesses in this article, mainly the lack of novelty and the lack of connection between the two main strategies proposed, which weakens the overall contribution.

Overall, I think this paper is not ready for ACL* venue yet and might still be a good pick for workshops.

Summary Of Strengths:
  1. The topic is the current trend in dense retrieval area which has value to explore.
  2. Good results on zero-shot evaluation with BEIR benchmark.
  3. The promise of open-source gives assurance of reproducibility.
Summary Of Weaknesses:
  1. The main contribution of the proposed approach lacks novelty. Since BM25 has strong zero-shot capability, it seems that the combination of the BM25 serves mainly to enhance the zero-shot capability of the model. The proposed ICoL is mainly an integration of some existing methods.
  2. The authors do not illustrate why it is better to keep representations in the same hidden space, it also has not been experimentally verified that they are in the same hidden space.
  3. The link between the ICoL and BM25 weighting is not fully worked out.
Comments, Suggestions And Typos:
  1. In line 300-301 "{ d i,1−, . . . , d i,n−} is a set of randomly sampled query negatives", query->document?
Overall Assessment: 2.5
Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design.
Best Paper: No
Replicability: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
Datasets: 1 = No usable datasets submitted.
Software: 1 = No usable software released.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
[–][+]

Official Review of Paper1309 by Reviewer mBe6

ACL ARR 2021 November Paper1309 Reviewer mBe6
16 Dec 2021ACL ARR 2021 November Paper1309 Official ReviewReaders: Program Chairs, Paper1309 Senior Area Chairs, Paper1309 Area Chairs, Paper1309 Reviewers, Paper1309 Authors
Paper Summary:

In this paper the authors propose a pre-trained retriever that does not require any supervised date for training, called LaPraDoR. Specifically, they present 3 contributions in their paper :

  1. LaPraDoR : an all-around unsupervised pretrained dense retriever, that achieves state-of-the-art performance on the BEIR benchmark.
  2. ICoL (Iterative Contrastive Learning) that iteratively trains the query and document encoders with a cache mechanism to mitigate the insufficient memory on a single GPU and allow more negative instances for better performance.
  3. LEDR (Lexicon-ENhanced Dense Retrieval) that combine BM25 with a dense retriever to consider both lexical and semantic matching. The authors evaluate these contributions on the recently proposed BEIR benchmark and show the effectiveness of their system.
Summary Of Strengths:

The paper is well written and clear, therefore the contributions presented are easy to understand. The authors propose 3 strong contributions that present an interest for the information retrieval community.

Summary Of Weaknesses:

The authors do not give any explanation or definition of the evaluation metric used. The results are presented really briefly, without discussion on the differences observed for the 18 english zero-shot evaluation datasets (while it is one of the interest of the BEIR benchmark).

Comments, Suggestions And Typos:

The authors should spend some time on analyzing the results obtained for the different datasets. For example, why the results for LaPraDor (unsupervised and FT) are lower than Late interaction for some datasets (FEVER or CQADupStack) while they are better for others ? Why the results are almost the same for Re-ranking, LaPraDoR unsupervised and FT while they are really differents for FEVER for example ? On the contrary, I found it really difficult to draw any conclusion from a case study with two examples. In the "Carbon footprint" section, the authors mention "All emitted carbon dioxide has already been offset". What do that mean exactly?

Overall Assessment: 4 = Strong: This paper is of significant interest (for broad or narrow sub-communities), and warrants acceptance in a top-tier *ACL venue if space allows.
Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design.
Best Paper: No
Replicability: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
Datasets: 1 = No usable datasets submitted.
Software: 4 = Useful: I would recommend the new software to other researchers or developers for their ongoing work.
Author Identity Guess: 1 = I do not have even an educated guess about author identity.