Blind Submission by November • LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval
Meta Review of Paper1309 by Area Chair CcQc
Interesting solution, but not fully dense.
- 3/4 reviewers acknowledge strong performance on BEIR benchmark
- 1/4 reviewer acknowledges the efficiency of the solution
- 1/4 reviewer praises the opensource effort
- 3/4 reviewers report that the proposed solution is not purely dense but hybrid (with BM25) and it is not clear what contributes the most.
- 2/4 reviewers suggest more ablations/baselines
- 1/4 reviewer finds some claims confusing
- 1/4 reviewer asks clarity on evaluation metrics
Official Review of Paper1309 by Reviewer YfgT
This paper aims to train a fully unsupervised pretrained retriever for zero-shot text retrieval. Specifi- cally, this paper proposes a novel iterative contrastive learning to pretrain dual-tower dense retriever, then use lexical matching method to further enhance the pretrained dense retrieval model. Results on BEIR benchmark show that the proposed method achieves SOTA performance compared with supervised dense retrieval model.
- The proposed unsupervised dense retrieval model achieves remarkable performance com- pared with supervised dense retrieval models on BEIR benchmark.
- Theproposedmodelisinitializedfrom6-layerDistilBERTandtheparametersofdualencoders are tied, so the proposed retrieval model is quite efficient.
- The selling point of this paper is unsupervised pretrained dense retriever(LaPraDoR) can per- form on par with supervised dense retriever, but actually, LaPraDoR is a hybrid retriever rather than a pure dense retriever. In a way, it’s unfair to compare hybrid method to dense/sparse method as shown in table 1, because it’s known that the dense retriever and sparse retriever are complementary. The comparable supervised models should also be hybrid retrievers.Besides, in table 3, it seems that without lexicon enhancement, the performance of proposed unsupervised model is not competitive either on in-domain MS-MARCO or cross domain BEIR benchmark compared with supervised model.
- In table 4, the combination of self-supervised tasks ICT and DaPI doesn’t seem to be com- plementary, the effectiveness of DaPI task, which will double the GPU memory usage, is not significant (0.434 -> 0.438)
- ICoL is proposed to mitigate the insufficient memory on a single GPU and allow more neg- ative instances for better performance, but there are no corresponding experiments to show the influence of the number of negatives. As far as I know, the quality of negatives is more important than the quantity of negatives as shown in TAS-B.
- It sounds unreasonable that increasing the model size can hurt the performance, as recent paper Ni et al. shows that the scaling law is also apply to dense retrieval model, so the preliminary experimental results on Wikipedia about model size should be provided in detail.
- Thepaperarguethattheproposedapproachistocomplementlexicalmatchingwithsemantic matching, while the training procedure of proposed model is totally independent with lexical matching. Therefore, the argument ”LEDR helps filter out such noise and allows the dense retriever to focus on fine-grained semantic matching” is confusing, because there is no suc- cession relationship between LEDR and dense retriever.
Reference: * Ni et al. 2021. https://arxiv.org/abs/2112.07899
the proposed LaPraDoR achieves relative low performance on MS-MARCO while relative high per- formance on BEIR, the inductive bias of the proposed pretrain method is worth exploring. Line 300-304: q and d are confusing
N/A
N/A
Official Review of Paper1309 by Reviewer E1hH
A method to train dense retrieval without the need of any supervised data using a combination of ICT and SimCSE. Further, they propose Iterative Contrastive Learning (ICoL) as a new way for caching negative embeddings.
- The paper is well written and easy to understand.
- The paper conducts many experiments and presents many interesting insights
- The results are convincing on the BEIR benchmark
- ICoL appears to be an interesting approach showing improvements over MoCo
- Overall a strong paper but with some weaknesses in evaluating the individual components / individual changes.
What is contributing to the performance improvement?
- Because the paper presents so many novelties, it is a bit hard to grasp what led to the improvement on BEIR, i.e. what are the main factors that contribute to the improvement?
- It appears (Table 3) the strongest improvement come from Lexicon-Enhanced Dense Retrieval, i.e. combining the dense sim score with a BM25 score. This hybrid approach has been shown effective in several previous works, e.g. https://arxiv.org/abs/2005.00181 or https://arxiv.org/pdf/2004.13969.pdf (and many more)
- It would be interesting to get the results for other dense + BM25 combinations. The dense retriever appears to be not the strongest (cf. Table 3, col. w/o LEDR), i.e. it is weaker than TAS-B (0.396 vs 0.415). So what would be the results of TAS-B+LEDR?
Pre-training approach
- A large contribution of the author is the proposal of a new pre-training approach that combines ICT + SimCSE with a large negative cache (ICoL)
- Table 3 shows an improvement for the sole dense retrieval model if pre-training is used (e.g. w/o LEDR vs. w/O LEDR & PT)
- However, we know that BERT is undertrained (see RoBERTa paper) and performing more pre-training steps yields an improvemed BERT model
- A comparison against other pre-training approaches would have been interesting. What happens when I train e.g. with MLM for the same amount of compute on C4 and then do fine-tuning? What about other pre-training methods for dense retrievers (for an overview see https://arxiv.org/abs/2112.07577)? Is the proposed pre-training method actually better than other pre-training strategies?
I think the 3 additions (pre-training, ICoL, hybrid retrieval with BM25) could be better separated / evaluated more individually. It would help to see what are the contributing factors for the improved performance. So far it appears that switching to a hybrid approach with BM25 made the large difference in performance.
Having a more clear separation in their evaluation would help to assess what is really relevant for future methods.
- Line 231: DaPI was concurrently also proposed by Liu et al (which they call Mirror-BERT https://arxiv.org/abs/2104.08027, published April 16; SimCSE was published April 18)
- Line 430: You say that the batch size for each GPU is 4,096, so 16*4k = 64k in total for your server. But how do you get a batch of 4,096 on a single 32GB GPU? A batch larger than 100 examples results usually in an out-of-memory error when using huggingface transformers with a distilbert model. Did you use some special off-loading?
- Maybe I might missed it in the paper: Did you specify your max sequence length for pre-training / fine-tuning? I.e. do you use 512 word pieces or do you pre-train on shorter paragraphs?
- Table 4: The heading "w/o ICT (SimCSE 2021c)" looks confusing, you would think that ICT was proposed by SimCSE 2021c)
- Table 4: Adding SimCSE to pre-training appears to bring little effect, performance improves from 43.4 -> 43.8. Could this improvement just be due to randomness? How does the performance change for the fine-tuning setting when pre-training was without SimCSE (i.e. just ICT pre-training followed by supervised MS MARCO training)?
- Table 5: The three settings, I assume these are the hybrid approaches with BM25. How is the performance only for the dense retrievers without BM25?
Official Review of Paper1309 by Reviewer K746
This paper proposes an unsupervised pre-training method for dense retriever. They propose ICoL to conduct effective pre-training borrowing the ideas from contrastive learning and ICT. The experimental results show its effectiveness on BEIR benchmark including 18 datasets. There are some weaknesses in this article, mainly the lack of novelty and the lack of connection between the two main strategies proposed, which weakens the overall contribution.
Overall, I think this paper is not ready for ACL* venue yet and might still be a good pick for workshops.
- The topic is the current trend in dense retrieval area which has value to explore.
- Good results on zero-shot evaluation with BEIR benchmark.
- The promise of open-source gives assurance of reproducibility.
- The main contribution of the proposed approach lacks novelty. Since BM25 has strong zero-shot capability, it seems that the combination of the BM25 serves mainly to enhance the zero-shot capability of the model. The proposed ICoL is mainly an integration of some existing methods.
- The authors do not illustrate why it is better to keep representations in the same hidden space, it also has not been experimentally verified that they are in the same hidden space.
- The link between the ICoL and BM25 weighting is not fully worked out.
- In line 300-301 "{ d i,1−, . . . , d i,n−} is a set of randomly sampled query negatives", query->document?
Official Review of Paper1309 by Reviewer mBe6
In this paper the authors propose a pre-trained retriever that does not require any supervised date for training, called LaPraDoR. Specifically, they present 3 contributions in their paper :
- LaPraDoR : an all-around unsupervised pretrained dense retriever, that achieves state-of-the-art performance on the BEIR benchmark.
- ICoL (Iterative Contrastive Learning) that iteratively trains the query and document encoders with a cache mechanism to mitigate the insufficient memory on a single GPU and allow more negative instances for better performance.
- LEDR (Lexicon-ENhanced Dense Retrieval) that combine BM25 with a dense retriever to consider both lexical and semantic matching. The authors evaluate these contributions on the recently proposed BEIR benchmark and show the effectiveness of their system.
The paper is well written and clear, therefore the contributions presented are easy to understand. The authors propose 3 strong contributions that present an interest for the information retrieval community.
The authors do not give any explanation or definition of the evaluation metric used. The results are presented really briefly, without discussion on the differences observed for the 18 english zero-shot evaluation datasets (while it is one of the interest of the BEIR benchmark).
The authors should spend some time on analyzing the results obtained for the different datasets. For example, why the results for LaPraDor (unsupervised and FT) are lower than Late interaction for some datasets (FEVER or CQADupStack) while they are better for others ? Why the results are almost the same for Re-ranking, LaPraDoR unsupervised and FT while they are really differents for FEVER for example ? On the contrary, I found it really difficult to draw any conclusion from a case study with two examples. In the "Carbon footprint" section, the authors mention "All emitted carbon dioxide has already been offset". What do that mean exactly?
Supplementary Materials by Program Chairs
Supplementary Materials by Program Chairs