Comments to author (Associate Editor) ===================================== The paper presents PDB-Eval, a novel dataset (and benchmark) for driven driver behavior descriptions based on Multimodal large language modes and introduces comparative prompting to reduce hallucinations. It is well-structured and demonstrates solid experimental results, making it a valuable contribution to autonomous driving research. However, I suggest the following improvements for a rebuttal: (1) a clearer statement of contributions and differentiation from existing datasets (also regarding the labeling process), (2) an expanded related work section, (3) additional explanations for key methodological choices, (4) clarification on dataset/code availability, and (5) improved presentation and citation consistency. 1.Clarity & Contributions: The paper lacks a clear statement of contributions and could better differentiate PDB-Eval from existing datasets (e.g., AIDE, BDD-X). 2.Related Work: The related work section is too short, particularly on MLLMs with multi-view inputs and behavior analysis in action-heavy settings. 3.Methodology & Explanation Gaps: Missing justifications for certain baseline choices (e.g., Brain4Cars, AIDE; and MLLM methods) and metrics (why not BERT additionally for semantic assesment?). Lack of discussion on comparative prompting biases. 4.Dataset & Code Release: It is unclear whether the dataset and pipeline code will be publicly released. 5.Presentation & Citations – Tables are missing citations in the text (e.g. Tabe 3), some references lack consistency (e.g. abbreviations, missing page numbers). Overall discussion needs strengthening! ---------------------------------------- Comments on Video demonstration: [None found] ===== Reviewer 1 of IV 2025 submission 401 Comments to the author ====================== You propose an approach to create a dataset for MLLM-based descriptions and explanations of driver's behavior. You contribute the dataset. You propose comparative prompting technique to identify specific driver's behaviors reducing hallucinations. You compare different MLLMs on this dataset and compare on other benchmarks. This is overall a nice paper, tackling a relevant topic in autonomous driving. The paper is generally well structured, well written and the results look sound. There are many ablation studies with clear results that can be compared to by other researchers. Here are a few comments for improvement: - In the introduction you mention that there is a domain gap on general visualization and explanation tasks for autonomous driving. However, there exists a lot of VQA research in this domain. - The related work is a bit short. While you mention more references in the methods, this chapter could be extended. Particularly in the direction of MLLMs with multi-view inputs, as there do exist multi-view scene descriptions in the autonomous driving domain. - The method is clearly structured and well described. - Chapter "IV. PERSONALIZED DRIVING BEHAVIOR EVALUATION (PDB-EVAL)": A clearer title that this is about a dataset and not results - Experiments overall: Nice results and clear communication in the tables. The experiments and ablation studies are plausible and the interpretations are reasonable. - Experiments "C: Visual Explanation QA (PDB-QA)" it is not clear which table is referenced here, or if a table belongs to this chapter at all. - Table III is not referenced in the text - "Comparing the performance of BLIP-2 and VTimeLLM, we can observe that VTimeLLM has a larger fine-tuning improvement com- pared to its zero-shot performance, which suggests that VTimeLLM has a higher performance upper-bound than BLIP-2." --> where can I find that in your results? - You mention limitations in your experiments, but an overall Discussion section summarizing your findings is missing. - You do not mention that you publish your dataset and if you publish your code for the pipeline Overall, this is a well-written paper with interesting results. Figures and Tables show clear qualitative and qualitative results. ===== Reviewer 2 of IV 2025 submission 401 Comments to the author ====================== With the increasing involvement of MLLMs in the autonomous driving domain, the need for precise descriptions and explanations of vehicle movements becomes more critical. In this context, PDB-Eval, comprising the two main components PDB-X and PDB-QA, contributes to the field of driving behavior descriptions by providing a dataset that enables both the fine-tuning and evaluation of MLLMs. The dataset is thoroughly described, including a quantitative and qualitative analysis of its scope and level of detail, along with a step-by-step walkthrough of its creation pipeline. The paper is generally well-written. However, there are some major shortcomings: • A clear formulation of the paper’s contributions is missing. • The explanations for the selection of baselines S-RNN (Brain4Cars) in Table V and AIDE in Table VI are missing, as well as their corresponding subsections. Concrete example: • Inconsistency in reference 6, which includes the conference abbreviation, whereas all other references omit it. ===== Reviewer 3 of IV 2025 submission 401 Comments to the author ====================== Driving behaviour understanding is a much needed research topic and a crucial challenge in autonomous driving. I appreciate the effort that went into developing PDB-Eval. The proposed benchmark is novel and provides a structured approach to align LLMs and their reasoning with human in the loop. Clear improvements made over Brain4Cars and AIDE are compelling, as well as the focus on the multi-view understanding. Comparative prompting, while effective, introduces potential biases that aren’t discussed. An ablation study on different prompting strategies would help isolate the method’s actual impact. Finally, temporal reasoning is flagged as a challenge but left unaddressed. More clarity on future directions here would be useful. Having said that, below are some specific comments: - The motivation for PDB-Eval is clear, but a more explicit differentiation from existing datasets (e.g., AIDE, BDD-X) would strengthen the novelty claim. - The related works section is too short and should be extended in context and out of context works that are analysing behaviour of humans in a monitored, action heavy settings as studying isolated driver behaviour assessment can be an intricate task by itself. - The choice of comparing only 2 drivers’ intentions per loop, as well as other factors that are essential to compare and align driving behaviour goes more than handling a similar scenario, so more clarity here would help. - Also, citations need cleanup—some references lack details, papers with more than 2 authors can be made “et al." and GPT-4V’s use should be carefully framed to avoid unverified claims. Overall, the paper is very impactful and technically relevant for the improvement of autonomous driving.