Reviewer #1 Questions 2. Importance/Relevance 4. Of broad interest 5. Originality/Novelty 3. Moderately original; provides limited new insights or understanding 6. Justification of Originality/Novelty Score (required) 3 bigger innovations: (1) Dual-Feature Extraction: CoLLAP employs a dual-feature extractor for processing full-length music tracks, utilizing Beats for music and Whisper for speech encoding, which allows the model to capture diverse audio characteristics within the waveform. (2) Structured Text Encoding: The model extracts text representations from captions generated by Music-LLMs, which are enhanced with musical structural information, providing a comprehensive understanding of the audio content across both global and temporal dimensions. (3) Multimodal Attention Mechanism: CoLLAP introduces a novel multimodal and temporal attention mechanism that calculates both kernel-wise and temporal attention to enhance the alignment between audio and text, leading to improved retrieval accuracy for long-form music and detailed textual descriptions. 7. Theoretical Development 4. Correct; provides important new insights or theoretical understanding 9. Experimental Validation 3. Limited but convincing 11. Clarity of Presentation 4. Very clear 13. Reference to Prior Work 4. Excellent references 15. Overall evaluation of this paper 4. Definite accept 16. Justification of Overall evaluation of this paper (required) Multimodal Alignment and Contrastive Learning Approach: The innovative use of multimodal alignment and contrastive learning strategies effectively captures the relationships between different data types, such as text and audio. This approach strengthens the model's generalization capabilities and robustness in handling cross-modal data. Robust Experimental Results: The paper presents strong experimental outcomes, which validate the effectiveness of the proposed models. These results are a testament to the rigorous experimental design and execution, providing a solid foundation for the research and its conclusions. Strong Writing Skills: The paper demonstrates excellent writing skills, which is crucial for clearly articulating complex ideas. Disadvantages: Introduction of Larger-Scale Pretrained Models, the paper could face challenges with the introduction of even larger-scale pretrained models like GPT-4 or LLaMa3. 20. Additional comments to author(s): (Required if no other justification comments have been provided above.) **Lack of Analysis on Scaling Laws**: There is a noticeable absence of analysis regarding the scaling laws of data, which is important for understanding how the model's performance and efficiency scale with increases in data size. This analysis could provide insights into the model's behavior and guide future research directions. Reviewer #2 Questions 2. Importance/Relevance 3. Of sufficient interest 5. Originality/Novelty 3. Moderately original; provides limited new insights or understanding 6. Justification of Originality/Novelty Score (required) Please see below. 7. Theoretical Development 3. Probably correct; provides limited new insights or understanding 9. Experimental Validation 3. Limited but convincing 11. Clarity of Presentation 3. Clear enough 13. Reference to Prior Work 3. References adequate 15. Overall evaluation of this paper 3. Marginal accept 16. Justification of Overall evaluation of this paper (required) The paper proposes to extend the audio-text models for longer audio such as music and longer text descriptions. Longer audio inputs are divided into smaller chunks via kernelization and each chunk is combined with the text embeddings via attention. The proposed model achieves significantly higher results on audio-text retrieval tasks for long inputs. However, it is unclear if attention is needed to combine the audio and the text information or if a simple dot product is sufficient. There is a typo in Table 2 for HSTAST and RoBERTa, all the metrics are R@5. They should be R@5, R@20 and R@100 respectively. Reviewer #3 Questions 2. Importance/Relevance 3. Of sufficient interest 5. Originality/Novelty 3. Moderately original; provides limited new insights or understanding 6. Justification of Originality/Novelty Score (required) It introduces a novel fusion mechanism that leverages kernel-wise and temporal attention and effectively scales contrastive learning to longer inputs. However, while these innovations are valuable, they build upon existing contrastive learning frameworks and established models, and the insights are incremental within the broader context of multimodal representation learning. 7. Theoretical Development 3. Probably correct; provides limited new insights or understanding 9. Experimental Validation 4. Theoretical paper: sufficient validation; Empirical paper: rigorous validation 11. Clarity of Presentation 3. Clear enough 13. Reference to Prior Work 3. References adequate 15. Overall evaluation of this paper 3. Marginal accept 16. Justification of Overall evaluation of this paper (required) This paper presents a well-structured approach to handling long-form audio and text data through the CoLLAP model, effectively extending contrastive learning to capture temporal and multimodal correlations. The methodology is sound, and the experimental results demonstrate clear improvements over existing baselines across various datasets and retrieval tasks. The use of two language model variants and extensive testing on diverse tasks add to the paper's empirical rigor. However, the novelty is moderate, as the approach mainly extends established methods rather than introducing groundbreaking theoretical advancements. Additionally, while the presentation is mostly clear, some complex sections could benefit from further clarification.