Paper Decision Decision: Accept (Poster) Comment: The authors propose a simple but effective method for zero-shot classification that leverages hierarchical label information. The paper was reviewed by three reviewers, with all recommending acceptance: 1 x Accept, 2 x Weak Accept ratings. Reviewers have raised a number of questions about the approach and provided some suggestions for analysis. Authors have provided adequate responses and ultimately reviewers appear content with them. Area Chair (AC) has read the reviews, author responses to them and looked at the paper itself. AC concurred with reviewers that approach is simple, interesting, effective and warrants publication in ICML. Rebuttal by Authors We sincerely thank all the reviewers for their detailed and thoughtful feedback. We are glad to see all the reviewers recommending acceptance, with reviewers highlighting clarity of writing (mdQe, B5kc), thoroughness of the experimental design and ablations (XemL, B5kc, mdQe), the simple yet effective proposed method (XemL, mdQe) and the depth of the literature review (B5kc). Following the constructive feedback of Reviewers mdQe and B5kc, we have run some additional experiments that we believe strengthen our paper. We summarize one of the key experiments below: Calibration Experiment. To investigate properties of the subclass probability distribution, we calculated the Expected Calibration Error (ECE). We observe that CHiLS consistently improves the calibration of the zero-shot CLIP prediction, often by 1--2 orders of magnitude. We include tabular results in individual responses below. Additionally, we investigated how our method CHiLS works for other open-vocabulary vision language models (namely, FLAVA) in zero-shot image classification, and found similar improvements in zero-shot accuracy when using FLAVA, thus highlighting that our results are not endemic to CLIP alone. Official Review of Submission2029 by Reviewer mdQe Summary: The authors propose a simple method that leverages hierarchical label information for the task of zero-shot image classification. For each class, the key idea is to obtain predictions on a set of subclasses via a standard zero-shot method, and then aggregate the information to obtain predictions for the original superclass. Through extensive experiments on a variety of datasets, the authors show improved performance on the zero-shot classification problem, both in situations where the subclass hierarchy is present/absent in the dataset. Strengths And Weaknesses: Strengths The paper is well written and easy to follow. The motivation behind the idea is sound and the proposed method is simple, easy to implement, and requires no additional training (apart from a CLIP model that is readily available) to perform well on zero-shot classification task. The experimental evaluation is extensive and done on a variety of datasets that have significantly different object types, thus highlighting the generalization of the proposed method to different distributions. I like the use of GPT-3 to extend the proposed method to situations where the subclass hierarchy isn't available. Although the improvements aren't as significant, this greatly increases the applicability of the proposed method. I appreciate the authors providing code (along with instructions) to verify the results. Weaknesses The authors show extensive experiments that highlight that the proposed approach performs favourably when looking at the top predicted class for each image (argmax in Algorithm 1). However, I am curious how the distribution over the class probabilities looks after using the proposed method (when compared to the baseline). My intuition here is that as is obtained by taking a softmax over all sub-categories, and is then additionally multiplied by the superclass probability (Line 7 in Algorithm 1), the resulting classifier confidence might be low, and the distribution of the class probabilities across supercategories might be closer together. This matters in situations where one might need to tune the classifier for high precision / recall, which is often done by setting a threshold over the prediction confidences. If the confidence values are low and closer together, setting such a threshold might be challenging. Building on this, does the Top-K (K=3,5) accuracy have similar improvements when compared with the baseline? How crucial is the quality of the generated sub-categories to the performance? Table 8 highlights that in certain situations, having a singular sub-category (m=1) alongside the super-category labels leads improved performance. I am curious if this performance gain is similar for an arbitrary choice of the sub-category. That is, would the performance gains hold if I arbitrarily choose between {huskies, corgis, labradors, or any other dog sub-category} in the m=1 set? Or do some sub-category choices have a higher impact? Extending the previous argument, would two sub-category sets of the same size containing different elements provide similar improvements? For example, considering the category dog, if set 1 is {huskies, corgis, labradors} and set 2 is {german shepherd, bulldog, poodle}, would using set 1 or set 2 be almost equivalent? If it isn't, how should one go about selecting an optimal sub-category set? The authors use GPT-3 to obtain sub-categories in situations where hierarchical information is absent. Can these sub-categories be obtained via a simple lookup in existing databases like WordNet? Questions: I like the general premise of the paper and do not have any major concerns. The thorough experimental evaluation, alongside extensive ablations, adequately highlight the effectiveness of the proposed method. Although some of the ideas used in this work have been explored previously in the hierarchical classification domain, the application to zero-shot learning is unique. I do have some questions that I have detailed in the Weaknesses portion of the review. Specifically, Although the top-1 performance is good, how does the top-k performance and the probability distribution over super-categories looks like when compared with the baselines? How sensitive is the model to different subsets of size . That is can two sub-category sets of the same size, but containing different elements, provide similar improvements? It would be helpful if the authors could provide some analysis to answer the above questions. Limitations: The authors have adequately addressed limitations of their work. They correctly highlight that their approach is applicable in scenarios where a hierarchy over classes exists, and additionally talk about looking into a more theoretical analysis of the empirically observed gains. Ethics Flag: No Soundness: 3 good Presentation: 3 good Contribution: 3 good Rating: 7: Accept: Technically solid paper, with high impact on at least one sub-area, or moderate-to-high impact on more than one areas, with good-to-excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations. Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Code Of Conduct: Yes Rebuttal by Authors Rebuttal: (Part 1) The authors thank the reviewer for the detailed feedback, and are happy to hear their positive view on the paper as a whole. The authors show extensive experiments … However, I am curious how the distribution over the class probabilities looks after using the proposed method (when compared to the baseline). … This matters in situations where one might need to tune the classifier for high precision / recall, which is often done by setting a threshold over the prediction confidences…. Building on this, does the Top-K (K=3,5) accuracy have similar improvements when compared with the baseline? We thank the reviewer for this insightful comment. As per your suggestion, we have conducted two major experiments into investigating the questions brought up here. To investigate some of the properties of the subclass probability distribution, we calculated the Expected Calibration Error (ECE), with 50 bins and equal mass weighting. For a fair comparison against the baseline, we take the max value (that is, subclass probability x superclass probability) from each superclass and renormalize it, getting our estimated superclass probabilities for CHiLS. Below, we show that CHiLS consistently improves the calibration of the zero-shot CLIP prediction, often by 1--2 orders of magnitude. This also highlights an important result: even in cases where top-1 accuracy performance decreases from using CHiLS (e.g., Food-101), CHiLS significantly improves the calibration of the model. Below, we report ECE numbers (multiplied by 100): Dataset Superclass CHiLS (True Map) CHiLS (GPT Map) CIFAR20 1.06 0.07 0.33 living17 1.68 0.03 0.05 RESISC45 1.40 N/A 0.27 Food-101 1.84 N/A 0.05 ObjectNet 0.83 0.17 0.60 As the reviewer noted, these results highlight that using CHiLS would significantly improve even in situations where we are interested in precision/recall based metrics. Replying to Rebuttal by Authors Rebuttal by Authors Rebuttal: (Part 2) We produced the top-3 and top-5 results for our method. In the case of CHiLS, we run what we have called “top-R” accuracy, which is to say we that we use the top R most probable subclasses such that it corresponds to K (i.e. 3 or 5) superclasses, in order to make our method directly comparable to the baseline. Below we see that in general, effect sizes tend to decrease as K increases, with the improvement gap shrinking and sometimes even becoming slightly worse at higher K. Dataset Superclass (top-1/3/5) CHiLS (True Map, top-1/3/5) CHiLS (GPT Map, top-1/3/5) CIFAR20 59.57/83.54/92.31 85.28/96.03/98.05 65.91/85.29/92.81 living17 94.88/99.71/99.82 97.38/99.73/99.94 99.24/99.73/99.97 RESISC45 72.56/91.79/95.79 N/A 72.75/90.71/95.16 Food-101 93.87/98.90/99.50 N/A 93.80/98.84/99.46 ObjectNet 53.12/82.62/91.87 85.36/96.06/98.40 53.53/82.19/91.85 Note that we have shown results only on a randomly picked subset of datasets. We have included all the results in the updated draft of the paper and would be happy to provide the full results here if the reviewer requests us to do so. The authors use GPT-3 to obtain sub-categories in situations where hierarchical information is absent. Can these sub-categories be obtained via a simple lookup in existing databases like WordNet? We agree that using WordNet could be a simple alternative. In fact, in our initial experiments, we explored using WordNet as our primary mechanism to extract subclasses. However, we ran into a number of issues using it. Namely, WordNet only seemed to reasonably work on the BREEDS datasets (which were already based on WordNet through the ImageNet hierarchy) as in other datasets we saw that a) class names at similar semantic levels were not at similar depths in WordNet (thus making it unclear where we should draw subclasses from) and b) many classes simply did not exist directly in the WordNet hierarchy. We have included a discussion on this in the updated draft. Replying to Rebuttal by Authors Rebuttal by Authors Rebuttal: (Part 3) How crucial is the quality of the generated sub-categories to the performance…I am curious if this performance gain is similar for an arbitrary choice of the sub-category…That is can two sub-category sets of the same size, but containing different elements, provide similar improvements? Given the reviewer’s question on the sensitivity of the contents of a label set (with fixed size), we performed two additional experiments: (i) by randomly sampling a fixed GPT subset; (ii) replacing half of the subclasses in the true hierarchical map with GPT generated subclasses. For the former experiment, we randomly generate label sets of fixed size (i.e. 5, 10), and track the variance in the output accuracy of our method. In summary, we find that the estimated variance due to different label sets of the same size is rather small, specifically around 1e-5. This experiment demonstrates that randomly choosing a GPT-generated subclass doesn’t impact CHiLS performance. This also highlights that often the benefit of CHiLS may mostly be due to giving the underlying CLIP model more options to decide from that are related to a given class. However, as noted in the previous experiment, we randomly selected GPT-generated sub-classes. Intuitively, if a specific subclass is a good representative of the superclass (or is a frequent subclass in the dataset), then one would expect that the benefit of using the corresponding subclass name may outweigh the benefit of using an alternate subclass which might be less representative (or frequent). To check this, we performed a second experiment where we looked at the performance of CHiLS on CIFAR20 when using a mix of the GPT-generated and true subsets, and show that this scenario achieves an accuracy of 73.6%, which is roughly between the all GPT case (65.9%) and the all true case (85.3%). Thus, the use of subclasses that are present in the data, as opposed to ones that aren't, does have a noticeable impact on performance. In future work, we hope to perform experiments where we allow variable subclass set length with GPT experiments. Our initial idea in this direction is to explore chain-of-thought prompting with large models, where we first query GPT-like models for the length of the subset class and then query to obtain the subset of that specific length. Response to Rebuttal Official CommentReviewer mdQe26 Mar 2023, 00:18Program Chairs, Area Chairs, Authors, Reviewer mdQe, Reviewers Submitted, Senior Area Chairs Comment: I would like to thank the authors for the detailed rebuttal. Most of my concerns have been adequately addressed, and I hope the authors put all the additional experiments presented here in the revised version of the paper as well. I will improve my score to an accept. Official Review of Submission2029 by Reviewer XemL Official ReviewReviewer XemL11 Mar 2023, 10:37 (modified: 14 Mar 2023, 06:43)Program Chairs, Area Chairs, Authors, Reviewer XemL, Reviewers Submitted, Senior Area Chairs Summary: In this work, the authors propose to adopt a hierarchical label set and add mappings at the input and output of the network, thus enriching the information of the language modality and thus increasing the performance. The work is very clever and interesting. Experimental results also show that the method is effective. Strengths And Weaknesses: Strengths: The idea of the work is very clever and interesting. Experimental results show that the method is effective. Ablation analysis is relatively complete. Weaknesses: This work was not compared with other methods in the experiment. Questions: See Weaknesses. Limitations: This work has no potential negative social impact. The authors also demonstrate the limitations of their work. Ethics Flag: No Ethics Review Area: I don't know Soundness: 3 good Presentation: 3 good Contribution: 3 good Rating: 6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations. Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Code Of Conduct: Yes Rebuttal by Authors Rebuttal: We thank the reviewer for their response and their positive feedback on our work. This work was not compared with other methods in the experiment. To the best of our knowledge, we are not aware of any paper that leverages label hierarchy information for zero-shot image classification in a way that would make sense as a fair comparison (we have included a detailed literature survey of related methods in Section 2). In order to combat this, we do attempt to provide comparisons of our method, through ablating the importance of the reweighting step (5.4 section 1), the choice of reweighting method (5.4 section 2), and the choice of subclass and prompt aggregation methods (5.4 section 7). If you have any additional suggestions for experiments, we would be happy to investigate them. Official Review of Submission2029 by Reviewer B5kc Summary: The paper presents a method for improving zero-shot image classification, especially for coarse class labels. In particular, the evaluation is focused on using the CLIP model by creating prompts (similar to the original CLIP paper) and performing vision-language scoring through cosine similarity. Sub-labels are identified for each coarse label (e.g., corgi for label dog) and matching is performed at the sub-label level, and inversely mapped back to the super-classes (coarse labels) through a simple re-weighting strategy. Evaluation shows that the proposed approach works very well (10%, 15%, even 30% improvement) when hierarchy is known. When the hierarchy is not known, authors present a neat idea to prompt GPT-3 to produce a set of sub-labels; unfortunately, performance improvements are modest here. Strengths And Weaknesses: Strengths: Very thorough related work that positions this paper well - in terms of addressing the label hierarchy instead of prompting techniques, adapters, or other modifications that previous works have proposed. Simple, but good idea to capture the hierarchy (especially sub-labels) that results in good performance improvements. Good and thorough evaluation. Results are shown on 15+ datasets in total. Ablations show the impact of known vs. unknown hierarchy, impact of re-weighting, and different algorithms for re-weighting. Well written paper. The supplementary material presents a lot of details and insights into the datasets, hierarchies, and detailed improvements across various parameters (e.g., Table 8 for number of GPT-3 sub-classes). Weaknesses: [Minor] My reaction (expectation?) from the title was that the super-classes of labels (e.g., ImageNet) would be used somehow. However, this paper considers sub-classes part of the hierarchy. I wonder if this can be captured in the title somehow? Some assumptions are being made. Limitations do describe them, with the main one being that such a method may not be applicable for fine-grained classification and that meaningful sub-labels should exist. Idea/suggestion for future work: I wonder if we can create sub-label prompts of fine-grained classes too by saying things like "corgi standing on all 4 legs visible from the side; or corgi facing the camera; ..."? Would such prompts be beneficial since CLIP is potentially more likely to match such a prompt rather than just the "an image of a corgi"? Basically, what are the authors thoughts on sub-classes not actually being semantic classes, but variations visible in the image? The hierarchy is quite key and performance with GPT-3 hierarchy is much lower than a given ImageNet like hierarchy. a) The paper would be much richer if it included a thorough qualitative analysis (take any one dataset) of what the sub-labels are and how they impact performance. For example, is GPT-3 performance low because m=10 is too big and creates sub-labels which are wrong? Or does it miss important sub-classes that then lead to bad matching? The paper could include an IoU (or some set metric) of the GPT-3 sets with the manually created ontologies that may help understand why GPT-3 performance is low. It would also be nice to present some label sets (ground-truth and GPT-3) in the appendix for a few datasets. b) Shouldn't m be a variable number for each class as would be expected in human-created ontology? Can the authors come up with a probabilistic justification for re-weighting by multiplying probabilities? Can it be thought of as some kind of joint probability modeling between super and sub-classes? A theoretical foundation for the work would be great to include. Questions about experiments: a) In Sec 4 and Table 1, what happened to the "standard" result at level 4? Any idea why it suddenly drops to 50% while all others are around 60-70%? b) What makes ObjectNet achieve a 32% improvement in performance? This is rather dramatic and might be good to try and explain (perhaps by presenting some of the labels as indicated in 3a). [Minor] a) Most papers are cited using \citet instead of \citep. When using it in this format, please consider plural authors as the subject of the sentence. For example, Ilharco et al. (2022) then build on .. and put forth. Not "builds" on and "puts" forth. b) Page 4, typo, probabilitiy --> probability. c) Page 7, L357 (second column), something is mentioned to be in purple, although I'm not sure. d) Page 6, L324, word "Section" missing before 5.4. Questions: For the rebuttal, authors may consider spending effort on answers to weaknesses point 3 and 4. 1, 2, 5 may be addressed if there is space. 6 can directly go into the updated text. Limitations: I think the authors do a good job at clearly stating the method's limitations. I can't think of any clear negative societal impact, maybe apart from hierarchies that may be culturally offensive. Ethics Flag: No Soundness: 3 good Presentation: 4 excellent Contribution: 3 good Rating: 7: Accept: Technically solid paper, with high impact on at least one sub-area, or moderate-to-high impact on more than one areas, with good-to-excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations. Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Code Of Conduct: Yes Rebuttal by Authors Rebuttal: (Part 1) We sincerely thank the reviewer for their incredibly detailed comments, and are glad to receive both the positive and constructive feedback. The paper would be much richer if it included a thorough qualitative analysis (take any one dataset) of what the sub-labels are and how they impact performance. For example, is GPT-3 performance low because m=10 is too big and creates sub-labels which are wrong? Or does it miss important sub-classes that then lead to bad matching? … We thank the reviewer for bringing up this important issue, and we will attempt to provide a succinct qualitative analysis here. We manually investigated the subclasses generated (m=10) for two datasets, ObjectNet (variable length ground truth subsets) and Entity30 (fixed length ground truth subsets). In general, we find that the Jaccard Similarity for most GPT generated subsets and their ground truth versions are relatively low, maxing out at only around 0.3 at best. While there are some cases in which the generated sets do recover a portion of the ground truth subsets (e.g. the generated subset for the ObjectNet class “garment” correctly includes the “t-shirt”, “jeans”, “skirt”, and “dress” subclasses that are in the groundtruth subset), for the most part the GPT-generated sets fail to recover most of the true ground truth subsets. Thus, it seems that the general poor performance of GPT-3 generated sets comes from an inability to correctly guess all the correct subclass labels even if the ground truth subset is smaller than the GPT-3 generated subset. This investigation has led us to recommend for practitioners that while using GPT-3 generated subsets can work well as an initial first step, manually tuning the subsets by hand with dataset-specific context may be beneficial. We have updated the final draft with this discussion and will additionally add examples of the label sets into the appendix as suggested. [Minor] My reaction (expectation?) from the title was that the super-classes of labels (e.g., ImageNet) would be used somehow. However, this paper considers sub-classes part of the hierarchy… The authors apologize for any miscommunication that we put forth through our choice of the title, and will consider possible modifications to the name of the paper and relevant terms. Replying to Rebuttal by Authors Rebuttal by Authors Rebuttal: (Part 2) Shouldn't m be a variable number for each class as would be expected in human-created ontology? Thank you for catching this, it is indeed correct that within datasets with naturally existing hierarchies, different classes may have different number of subclasses (in our case, the true subclasses in Fruits360 and ObjectNet are both examples of this, as well as the ablation in 5.4 Section 3). In our experiments, the label set size is only fixed when using the GPT-generated label sets, and we allow the label set size to be variable when using the true hierarchies, and have modified the draft in order to make this fact explicitly clear. In future work, we hope to perform experiments where we allow variable subclass set length with GPT experiments. Our initial idea in this direction is to explore chain-of-thought prompting with large models, where we first query GPT-like models for the length of the subset class and then query to obtain the subset of that specific length. Can the authors come up with a probabilistic justification for re-weighting by multiplying probabilities? Can it be thought of as some kind of joint probability modeling between super and sub-classes? A theoretical foundation for the work would be great to include. Based on your comment and the suggestions of Reviewer mdQe, we attempted to improve our understanding of how CHiLS works by looking at how calibrated the model is with respect to its output probabilities, and here we find that the reweighted outputs, once normalized, considerably improve the calibration of the underlying CLIP model over the baseline. For completeness, we copy over the table from the other response. Here, we report Expected Calibration Error numbers (multiplied by 100). Dataset Superclass CHiLS (True Map) CHiLS (GPT Map) CIFAR20 1.06 0.07 0.33 living17 1.68 0.03 0.05 RESISC45 1.40 N/A 0.27 Food-101 1.84 N/A 0.05 ObjectNet 0.83 0.17 0.60 Replying to Rebuttal by Authors Rebuttal by Authors Rebuttal: (Part 3) In Sec 4 and Table 1, what happened to the "standard" result at level 4? Any idea why it suddenly drops to 50% while all others are around 60-70%? After a preliminary investigation of this phenomenon, we think that this behavior is due to many scientific names used at Level 4. Note in this experiment we used the BREEDS organization of the hierarchy (due to its ease of use and its enforcement of similar semantically granular terms are situated at the same depth) and many of the class names are rather scientifically technical. For instance, the class lagomorph contains “rabbit” and “hare”, and the class gallinacean contains subclasses such as “peacock” and “quail.” Given this fact, we attribute the poor performance of the “standard” model due to a poor embedding representation of CLIP for rarer and more scientific taxonomic terms. What makes ObjectNet achieve a 32% improvement in performance? This is rather dramatic and might be good to try and explain (perhaps by presenting some of the labels as indicated in 3a). In Table 12 in the Appendix, we show the exact image classification task that we perform when referring to ObjectNet. We specifically take the subset of classes that have overlap with the ImageNet hierarchy, and then use the BREEDS hierarchy to create a custom subset of ObjectNet at a reasonably coarse granularity (specifically, depth 3 of the ImageNet hierarchy). The large gap in performance can be explained by the following reason. ObjectNet is a notably hard dataset by design, and is made more difficult still by the ambiguity of the coarse class names (namely, the coarse names retrieved by the ImageNet hierarchy of “appliances”, “equipment”, and “accessory” may pose a challenge for the base CLIP model). Using the ground truth subclass labels is able to get around the uninformative superclass names, as well as control for the fact that the subsets are not balanced, with some (like “accessory”) having 12 subclasses but other (like “beverage”) having only 1 subclass. We will update our appendix with this discussion. Replying to Rebuttal by Authors Rebuttal by Authors Rebuttal: (Part 4) Idea/suggestion for future work: I wonder if we can create sub-label prompts of fine-grained classes too by saying things like "corgi standing on all 4 legs visible from the side; or corgi facing the camera; ..."? … beneficial since CLIP is potentially more likely to match such a prompt rather than just the "an image of a corgi"? We appreciate the astute suggestion, and note that this sort of idea is very similar to the work of Pratt et al. [2022], which uses GPT-3 to generate more expressive prompts for each class. We hope to investigate ways where we can combine CHiLS with this line of work in order to improve performance with both subclasses and expressive prompts in future. We do also note that early on in our research, we did try other possible forms of sub-label prompts through using synonyms rather than hyponyms, though such results did not perform particularly well.