Comments to author (Associate Editor) ===================================== Most of the reviewers agree that the paper is interesting, well written, and that the results are promising. The main aspects to address in a final version include: (1) Improving the motivation for using a VLM model and its actual usefulness (Reviewer 2 and 3) (2) Discussion on the benefits of the CLIP features within the approach (Rev. 1, Rev. 2) (3) Discuss limitations and failure cases (Rev. 1, Rev 3) ---------------------------------------- Comments on Video Attachment: There was agreement on the video being good, but one reviewer pointed out that adding more case studies would make it better. Reviewer 1 of IROS 2025 submission 1033 Comments to the author ====================== *Summary This paper introduces an interesting approach to VLM-guided, language-based navigation, specifically targeting the challenge of viewpoint variation in pre-trained models like CLIP. The authors propose a weakly supervised contrastive learning framework that refines CLIP features to enhance performance on visual-language navigation tasks. They evaluate their method on R2R, REVERIE, and SOON, comparing it against baselines without LLMs/VLMs, with fine-tuned LLMs/VLMs, and with pre-trained LLMs/VLMs. *Positive Aspects The paper is clearly written and well-structured, presenting an interesting method with thorough comparisons across three datasets. The results are compelling – demonstrating consistent improvements over approaches using pre-trained LLMs/VLMs, and achieving performance on par with or better than methods that do not rely on language models. *Areas of Improvement Below are some questions and minor areas of improvement: Sample Selection Process: - Are there any failure cases, with the positive sample selection? For example, the object lists are not very descriptive, being constrained only to one word labels for each object in an observation. With m being set to one, how many false positive pairs do you get? - What happens if you make the prompt output more descriptive? For example color + label, such as “black table”, “red cushion”, does that improve the accuracy of assigned positive pairs? The Instruction Understanding Module paragraph could do with a few more technical details Ablations: - What happens if you increase m? - What is the performance improvement between using the original pre-trained CLIP embeddings vs using the refined embeddings from the contrastive learning framework? Experiment Tables: - Could do with an extra column to indicate whether they use VLMs/LLMs or not or whether they fine-tune it for clarity Comments on the Video Attachment ================================ The video is clear and well-presented Reviewer 2 of IROS 2025 submission 1033 Comments to the author ====================== The main contribution of this paper is to introduce a model that helps the agent to recognize the same objects from different viewpoints. This model is trained using VLM‑guided weakly supervised learning. The experiments results show that this method surpasses existing methods on most metrics in the R2R, REVERIE, and SOON benchmarks. However, I still find the motivation debatable. For example, is it really important for an agent to recognize the same object from different views in the VLN task? If so why not use a VLM model to do it directly during the inference? It would be clearer if the authors included experiments to demonstrate the necessity of this mechanism and explained why a VLM model cannot be used directly. Also, there are often multiple objects of the same category in the observationsófor instance, several chairs or booksóbut in the proposed method these are treated as the same object. This could increase the agentís uncertainty when interpreting the environment, especially under high-level instructions as in the REVERIE benchmark. Moreover, some previous works with better performance are missing in the REVERIE, such as AIGeN [1]. The main difference between this paper and DUET [22] lies in the visual perception component. It is unclear whether the proposed methodís superior performance stems from this new mechanism or merely from visual features extracted by the CLIP model. The authors could also benefit from including more case studies to better demonstrate the superiority of their method and highlight differences from other approaches. Give the insufficiently credible motivation and the lack of important experiments and analysis, this paper is on the low borderline of acceptance. [1] Rawal, Niyati, et al. "AIGeN: An Adversarial Approach for Instruction Generation in VLN." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*. 2024. Comments on the Video Attachment ================================ This video lacks detailed explanations of the method and analysis of the results. It would be better if the video included at least one slide to highlight the visual perception component, and only one sample for the case study is not enough. Reviewer 3 of IROS 2025 submission 1033 Comments to the author ====================== This paper investigates the vision-and-language navigation (VLN) task and introduces a weakly-supervised partial contrastive learning (WPCL) method to enhance the agent’s ability to recognize objects from dynamic viewpoints. The proposed approach is evaluated on three datasets and achieves competitive performance. While the paper has merits, several issues should be addressed: 1. Limited Consideration of Environmental Information: The paper primarily focuses on object recognition in VLN environments. However, objects represent only a portion of the environmental context. Other crucial factors, such as room types and action semantics, are not sufficiently considered. A broader discussion on these aspects would strengthen the work. 2. Limited Necessity of Vision-Language Models (VLMs): The role of VLMs in this paper appears marginal. The authors utilize VLMs primarily for object detection to assist task-specific model training. This contrasts with other VLM-based approaches, where VLMs serve as the primary backbone. Consequently, the comparisons in Table I may not be entirely fair, particularly when juxtaposing the proposed method with LLM-based approaches. 3. Ambiguity in Causal Structure Representation: The causal structure presented in Figure 3 is unclear and does not fully align with standard causal framework definitions. The authors should provide a more precise explanation or revise the figure to better adhere to causal inference principles. 4. Reproducibility and Code Release: To facilitate research reproducibility and contribute to the VLN community, I strongly recommend that the authors release their code and trained models. By addressing these concerns, the paper can be significantly improved. Comments on the Video Attachment ================================ The attached videos provide clear and well-organized presentations, effectively illustrating key aspects of the manuscript. The content is informative and complements the paper's discussions.