Reviewer #1 Questions 1. [Reviewer Guidelines] By taking this review assignment and checking on "I agree" below, I acknowledge that I have read and understood the reviewer guidelines (https://iccv2023.thecvf.com/reviewer.guidelines-362000-2-16-20.php). Agreement accepted 2. [Large Language Model (LLM) Ethics] ICCV’23 does not allow the use of Large Language Models or online chatbots such as ChatGPT in any part of the reviewing process. There are two main reasons: - Reviewers must provide comments that faithfully represent their original opinions on the papers being reviewed. It is unethical to resort to Large Language Models (e.g., an offline system) to automatically generate reviewing comments that do not originate from the reviewer’s own opinions. - Online chatbots such as ChatGPT collect conversation history to improve their models. Therefore their use in any part of the reviewing process would violate the ICCV confidentiality policy (https://iccv2023.thecvf.com/ethics.for.reviewing.papers-362100-2-16-21.php). Herewith I confirm that I have not used an online chatbot such as ChatGPT in preparing the review. This review reflects my own opinions, and no parts were generated by an automatic system. I agree 4. [Summary] Describe the key ideas, experiments, and their significance (preferably in 5-7 sentences). The authors propose a concise description of images, using natural language. In particular, the authors query a large language model for *categories of descriptions. These categories are filtered post-hoc, then used a "report card" to represent images. The authors show that 1) this description can be used as a substitute for classifying images and 2) can be perturbed to fix classifications. The authors perform extensive experiments. 5. [Strengths] Consider the aspects of key ideas, experimental or theoretical validation, writing quality, and data contribution (if relevant). Explain clearly why these aspects of the paper are valuable. Short, unjustified review is NOT OK. - Summarizing images into a succinct set of descriptors is interesting -- technically, a super compression method geared towards our downstream task. This is also perfectly sensible for classification, where we don't need *all the image's information. - The author's mechanisms for showing the effectiveness and utility of these descriptors are sound. I wonder if there's a quantitative way to show this is "better" than saliency maps or previous explainability approaches. Namely, you can't correct a saliency map like you can these descriptors (at least, not in as straightforward a way). - The authors evaluate on a number of datasets, CUB, CIFAR, Flower, Food, etc. (Related comment in weaknesses section); the authors furthermore ablate and compare against sensible, obvious baselines for obtaining visual descriptors. This is very helpful. - In summary, the concept of visual natural language descriptors seems novel to me, as does the mechanism for obtaining them. Evaluation is fairly thorough as well. - Figures are well done. Thank you to the authors. 6. [Weaknesses] Consider the aspects of key ideas, experimental or theoretical validation, writing quality, and data contribution (if relevant). Explain clearly why these are weak aspects of the paper. Short, unjustified review is NOT OK. For example: a comment saying that this has been done before must be accompanied by specific reference(s) and an explanation of how they overlap; a comment saying that the experiments are insufficient to validate the claims should be accompanied by explanations of what exactly is missing; not achieving state-of-the-art, or not surpassing prior methods, is not sufficiently a weakness, unless this comment is justified; a comment of “lack of novelty” should be carefully justified, and novelty could involve new insights/understandings or new discoveries/observations, not just new technical methods. - ImageNet is notably missing. Do you have an idea of how visual descriptors would "perform" on ImageNet? Maybe even TinyImageNet. I'm not actually asking to train a classification model on ImageNet, as I know computational resources may not make that accessible. I list a curiosity below in "additional comments" (A) that may allow you to quickly "evaluate" your method on ImageNet and smaller datasets, much quicker. Basically, construct an oracle instead of training a classifier. 7. [Paper rating] 2. Weak accept 8. [Recommendation confidence] Not confident: I do not work on this topic, and I am not confident on my evaluation. 9. [Justification of rating] Provide detailed justification of your rating. It should involve how you weigh the strengths and weaknesses of the paper. The problem and the solution seem novel and interesting. I'm mainly looking for some sort of analysis on ImageNet. Doesn't need to be a fully trained model. 10. [Additional comments] Minor suggestions, questions, corrections, etc. that can help the authors improve the paper, if any. (A) To show the effectiveness of the final set of descriptors, I wonder if there's a way to construct an oracle that shows "In theory, this set of descriptors allows us to achieve 100% classification accuracy". One way would be to check: "Are there any cases where two samples from different classes get *exactly the same descriptor?" I know the descriptors are soft probabilities so this may not be achievable exactly, but maybe some variation of this? In fact, you might be able to construct a (albeit silly) super deep decision tree in this way, based on your descriptors. I'm not looking for *even more experiments, but curious if there are known (or conceived) metrics to this effect. This could help you scale to larger datasets (and make an even more impressive experiments section) without training a whole ton. (B) I imagine this wouldn't generalize (out of the box) to, say, object detection or segmentation. Effectively, any task that requires localization. This is obviously only tangentially related but wondering if authors have thoughts on how this could extend to localization tasks. (C) Certainly not for this paper, like my curiosities above, but I wonder if you could use these visual descriptors to "distill" a model into a super fast classifier. Something silly like a 5-layer MLP that runs 100x faster than ResNet. 11. [Dataset contributions] Does a paper claim a dataset release as one of its scientific contributions? No 12. [Post-Rebuttal Recommendation] Give your final rating for this paper. Don't worry about poster vs oral. Consider the input from all reviewers, the authors' feedback, and any discussion. (Will be visible to authors after author notification) 2. Weak Accept 13. [Post-Rebuttal Justification] Justify your post-rebuttal assessment. Acknowledge any rebuttal and be specific about the final factors for and against acceptance that matter to you. (Will be visible to authors after author notification) Thanks to the authors for the responses. I should've clarified I saw ImageNet-Animals and was looking for a run on the full set of ImageNet classes. In any case, it wasn't a major concern, so I retain my rating. Reviewer #2 Questions 1. [Reviewer Guidelines] By taking this review assignment and checking on "I agree" below, I acknowledge that I have read and understood the reviewer guidelines (https://iccv2023.thecvf.com/reviewer.guidelines-362000-2-16-20.php). Agreement accepted 2. [Large Language Model (LLM) Ethics] ICCV’23 does not allow the use of Large Language Models or online chatbots such as ChatGPT in any part of the reviewing process. There are two main reasons: - Reviewers must provide comments that faithfully represent their original opinions on the papers being reviewed. It is unethical to resort to Large Language Models (e.g., an offline system) to automatically generate reviewing comments that do not originate from the reviewer’s own opinions. - Online chatbots such as ChatGPT collect conversation history to improve their models. Therefore their use in any part of the reviewing process would violate the ICCV confidentiality policy (https://iccv2023.thecvf.com/ethics.for.reviewing.papers-362100-2-16-21.php). Herewith I confirm that I have not used an online chatbot such as ChatGPT in preparing the review. This review reflects my own opinions, and no parts were generated by an automatic system. I agree 4. [Summary] Describe the key ideas, experiments, and their significance (preferably in 5-7 sentences). This paper claims to propose a new paradigm for visual recognition by learning a concise set of text attributes. These attributes are found in three steps: 1) A large language model (LLM) is queried and a large set of text attributes are generated. 2) The cosine similarity between each image and text attribute feature is computed. 3) The attributes are pruned by a learning-to-search method to a smaller set. At test time, the cosine similarity between the image feature of a test image and each attribute in the final set of attributes is computed, and linear probing is performed for the final classification. The authors conducted experiments on datasets across several domains to evaluate the text attributes selected by their method. The authors demonstrate superior performance with the same number of text attributes compared to previous works. 5. [Strengths] Consider the aspects of key ideas, experimental or theoretical validation, writing quality, and data contribution (if relevant). Explain clearly why these aspects of the paper are valuable. Short, unjustified review is NOT OK. * The idea of tracing how a vision-language model classifies an image to a few concepts is interesting. However, the setting/benchmark is unclear (see below). * The attributes generated by the method are much more concise than previous methods (e.g., 32 attributes for distinguishing 200 species in CUB). * The authors also demonstrate that the attributes generated by the proposed method form a better basis than the attributes generated by previous work LaBo on CUB and CIFAR 100. * Ablations are performed to demonstrate their learning-to-search method is better than other baseline methods for selecting the attributes such as K-Means and Uniform sampling. 6. [Weaknesses] Consider the aspects of key ideas, experimental or theoretical validation, writing quality, and data contribution (if relevant). Explain clearly why these are weak aspects of the paper. Short, unjustified review is NOT OK. For example: a comment saying that this has been done before must be accompanied by specific reference(s) and an explanation of how they overlap; a comment saying that the experiments are insufficient to validate the claims should be accompanied by explanations of what exactly is missing; not achieving state-of-the-art, or not surpassing prior methods, is not sufficiently a weakness, unless this comment is justified; a comment of “lack of novelty” should be carefully justified, and novelty could involve new insights/understandings or new discoveries/observations, not just new technical methods. * In the abstract, the author describes the core focus of their work as "to discover a concise set of attributes that are discriminative to achieve strong recognition accuracy". If strong recognition accuracy is the key goal of the method, this work lacks a few important baselines (see below). * The setting is unclear: since labels are used in training and the number of training samples is not limited to a small number (i.e., few-shot), the authors focus on a supervised learning setting, but this is unclear since the comparison does not involve supervised learning baselines. * The author uses CLIP models as their vision language model. However, there is no comparison to the zero-shot setting in CLIP, which directly uses the class name. * The author also mentions the additional advantage of high interpretability. However, these abilities are insufficiently defined and justified. The authors propose the importance score (IS) as a justification for interpretability in Sec 3.4 (1). However, this importance score is insufficiently evaluated, as it is only computed for two examples in Fig. 5 without comparisons. * The comparison with LaBo is only done on two datasets (CUB, CIFAR-100). Comparisons on other datasets with previous methods are not present. * The writing of several parts of this paper is unclear. For example, Fig. 4 is unclear: (1) It is unclear what is the dataset used to train the models shown in Fig. 4. (2) It is unclear what are "similar words". The examples "red,gray,snow wings" are insufficient for providing how these similar words are collected and selected. * There are a few grammar mistakes that hinder the understanding of the work. For example, L336: "But ..." -> "However, ...". L752: "in an automatic" -> automatically. 7. [Paper rating] 4. Weak reject 8. [Recommendation confidence] Somewhat confident: I do not directly work on this topic, but my expertise and experience are sufficient to evaluate this paper. 9. [Justification of rating] Provide detailed justification of your rating. It should involve how you weigh the strengths and weaknesses of the paper. The paper lacks clarity in the core setting and justification of key concepts, and several baselines are also omitted. Authors are encouraged to provide baselines for their work for a fair comparison. 10. [Additional comments] Minor suggestions, questions, corrections, etc. that can help the authors improve the paper, if any. Minor details: The paper (pdf file) does not have the paper ID filled in (6433). However, this doesn't affect the rating. 11. [Dataset contributions] Does a paper claim a dataset release as one of its scientific contributions? No 12. [Post-Rebuttal Recommendation] Give your final rating for this paper. Don't worry about poster vs oral. Consider the input from all reviewers, the authors' feedback, and any discussion. (Will be visible to authors after author notification) 4. Borderline Reject 13. [Post-Rebuttal Justification] Justify your post-rebuttal assessment. Acknowledge any rebuttal and be specific about the final factors for and against acceptance that matter to you. (Will be visible to authors after author notification) The authors partially addressed my questions. However, several parts of the work remain unclear. In Q1, the authors mention that their goal is to improve over the baselines. However, in Table 1 of the rebuttal, the proposed method achieves lower accuracy than the plain supervised learning baseline (CLIP-Train Visual, Q2). The authors' response to this is that the work aims to classify images with attributes to gain interpretability (response to Q3). However, the definition of interpretability is vague, as admitted by the authors in the response to Q4 that the method only offers "a level of interpretability". Thanks for pointing out the additional settings in the appendix (Q5). Therefore, I raised my score to borderline reject. The authors are encouraged to further improve the clarity of the work, including the setting, definitions, and writing, as stated in the questions (Q6/7). Reviewer #3 Questions 1. [Reviewer Guidelines] By taking this review assignment and checking on "I agree" below, I acknowledge that I have read and understood the reviewer guidelines (https://iccv2023.thecvf.com/reviewer.guidelines-362000-2-16-20.php). Agreement accepted 2. [Large Language Model (LLM) Ethics] ICCV’23 does not allow the use of Large Language Models or online chatbots such as ChatGPT in any part of the reviewing process. There are two main reasons: - Reviewers must provide comments that faithfully represent their original opinions on the papers being reviewed. It is unethical to resort to Large Language Models (e.g., an offline system) to automatically generate reviewing comments that do not originate from the reviewer’s own opinions. - Online chatbots such as ChatGPT collect conversation history to improve their models. Therefore their use in any part of the reviewing process would violate the ICCV confidentiality policy (https://iccv2023.thecvf.com/ethics.for.reviewing.papers-362100-2-16-21.php). Herewith I confirm that I have not used an online chatbot such as ChatGPT in preparing the review. This review reflects my own opinions, and no parts were generated by an automatic system. I agree 4. [Summary] Describe the key ideas, experiments, and their significance (preferably in 5-7 sentences). This paper studies the problem of visual attribute discovery. The idea is to query thousands of visual attributes to distinguish different object categories from the strong knowledge bases--LLMs. However, evidence shows that these attributes are noisy: using a huge amount of randomly selected attributes achieves similar performance with the ones generated by LLMs. To address the issue, this paper proposes a learning-to-search approach for filtering the noise embedded in the LLM's attributes. The idea is to optimize a dictionary for each dataset and find the best-matched attribute from the pool. The proposed method outperforms baselines in 8 benchmark classification dataset. 5. [Strengths] Consider the aspects of key ideas, experimental or theoretical validation, writing quality, and data contribution (if relevant). Explain clearly why these aspects of the paper are valuable. Short, unjustified review is NOT OK. 1. I like the idea of mining fine-grained visual concepts from existing large language models. LLMs are shown to encode enormous common sense knowledge, for example, they have been applied to solve diamond mining in Minecraft[1]. In computer vision, fine-grained recognition is still an opening problem, where the annotations are too expensive to obtain. Hence, leveraging the knowledge encoded in LLM's is an interesting exploration to solve fine-grained visual recognition. 2. The idea of optimizing a feature dictionary and then find the best matched attributes from the pool is neat. This paper also shows that the regularization and objective are effective and outperform baselines by a large margin. 3. Compared to human annotated attributes (e.g. the ones in CelebA / CUB), the attributes generated by LLMs are more complex and higher order. It is not merely a single property of objects (e.g. colors, parts...) but the combination of different states. This kind of higher-order integration would be helpful for fine-grained recognition, though information are much noisier. 4. The paper is written well. It's easy to follow and understand the motivation. [1]: Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents. Wang et al. 6. [Weaknesses] Consider the aspects of key ideas, experimental or theoretical validation, writing quality, and data contribution (if relevant). Explain clearly why these are weak aspects of the paper. Short, unjustified review is NOT OK. For example: a comment saying that this has been done before must be accompanied by specific reference(s) and an explanation of how they overlap; a comment saying that the experiments are insufficient to validate the claims should be accompanied by explanations of what exactly is missing; not achieving state-of-the-art, or not surpassing prior methods, is not sufficiently a weakness, unless this comment is justified; a comment of “lack of novelty” should be carefully justified, and novelty could involve new insights/understandings or new discoveries/observations, not just new technical methods. 1. The efficacy of attributes are not fully demonstrated on these benchmarks. As shown in Figure 3, naive image features easily outperform attributes. Demonstrating merely on classification tasks, it is inefficient and less optimal to use attributes compared to image features. Especially, image features does not need additional processing. However, I believe that attributes enable a detailed and multi-dimensional recognition of an input image. It would be more convincing to show cases where attributes are better than image features. I can think of following tasks: 1) zero-shot recognition, and 2) out-of-distribution recognition. * For 1), the question is can we recognize novel categories (e.g. centaur) using the attributes derived from ImageNet? * For 2), can you showcase that attribute-based recognition performs better than discrete classifiers on ImageNet-A? 2. How is the performance compared to "Visual Classification via Description from Large Language Models, Menon and Vondrick"? 3. What do you mean "we use a classification head to project the dictionary into KC classes" in Line 325? I guess you project E (a K*D matrix) into a (C*D) classifier layer? Also, how do you derive p_{i, c}? 7. [Paper rating] 2. Weak accept 8. [Recommendation confidence] Somewhat confident: I do not directly work on this topic, but my expertise and experience are sufficient to evaluate this paper. 9. [Justification of rating] Provide detailed justification of your rating. It should involve how you weigh the strengths and weaknesses of the paper. I like the idea of distilling visual information from LLMs. The proposed learning-to-search method is neat. The paper is written well. The results are strong. However, it would be more convincing if the authors can demonstrate the benefit of using attributes over discrete categories for visual recognition. 10. [Additional comments] Minor suggestions, questions, corrections, etc. that can help the authors improve the paper, if any. It'd be very interesting to see the out-of-distribution or zero-shot classification results. 11. [Dataset contributions] Does a paper claim a dataset release as one of its scientific contributions? No 12. [Post-Rebuttal Recommendation] Give your final rating for this paper. Don't worry about poster vs oral. Consider the input from all reviewers, the authors' feedback, and any discussion. (Will be visible to authors after author notification) 2. Weak Accept 13. [Post-Rebuttal Justification] Justify your post-rebuttal assessment. Acknowledge any rebuttal and be specific about the final factors for and against acceptance that matter to you. (Will be visible to authors after author notification) The rebuttal addresses my concerns. I will suggest an acceptance to this paper.