Overall, all three reviewers agree that the paper is very interesting and worth publishing at IJCV. The reviewers have also provided very detailed suggestions, and addressing them will hopefully improve the submission significantly. In particular, there seems to be an interest in assessing the relative contribution of different components of the system. Reviewer #1: In large-scale image annotation, one cannot assume that all training/test images have been tagged with all the objects it contains. This is reflected by the measures of accuracy which are employed to quantify the loss on such datasets, e.g. the top-k loss in ILSVRC. This paper makes two contributions: 1) While most state-of-the-art systems simply train one-vs-rest classifiers, it is proposed to directly optimize the loss which is used in the evaluation (or, to be more accurate, a continuous surrogate of this loss). This can be achieved using the structured prediction framework. Rather than training all classifiers from scratch, it is proposed to train a single parameter vector theta that re-weights the features of every class. 2) To 'strengthen' the weakly labeled groundtruth, it is proposed to introduce a set of latent variables. They are designed to represent those objects that appear in an image, but were not annotated. The non-convex learning proceeds by alternatively optimizing the latent variables and the parameter vector theta. Experimental results on three public benchmarks show that improvements can be obtained, especially when images are annotated with a large number of correlated labels. I have mixed feelings about this paper. On the positive side: 1) It is superbly written: the English is perfect, the overall structure is crystal clear, the presentation is extremely didactic and the literature review is a model of its genre. 2) The proposed ideas are novel (to the best of my knowledge). 3) Experiments extend significantly those reported in the conference paper. On the negative side: 1) The authors argue that they cannot learn the parameters of all classifiers because "this would mean simulateously optimizing several million parameters, which is not practical using existing structured learning techniques". Hence, they propose to optimize a single parameter vector theta. What the authors therefore seem to imply is that the training cost is dominated by the gradient steps, not by the arg max of equation (11). Is it really the case in practice? I would need to see a study of the relative costs to be convinced. 2) Learning a single parameter is prone to be suboptimal with respect to learning all classifiers, at least if enough training images are available. Indeed, the dimensions which are important are expected to vary from class to class. Learning the 1,000 classifiers on the full 1.2M training images of the ILSVRC dataset might be difficult with current structured learning algorithms. However, this should be perfectly feasible on the ImageCLEF and MIR datasets which contain a much smaller number of classes and training samples. The paper would be much stronger if it contained a comparison to the "learning all classifiers" alternative on these small datasets. We would then get a feeling of "how much is lost" by learning a single theta. 3) The motivation for using latent variables is that "the joint parametrixation of (eq. 7) is problematic, since the ernergy of the groundtruth labeling Y^n (...) is not readily comparable with the energy of the predicted output Y." While this might be true, this makes the problem more complex and especially non-convex. Moreover, nothing prevents from running the algorithm without the latent variables. Therefore, the paper would be much stronger if it reported experiments without the latent variable component to understand how much it lost when not using latent variables. This should be easy to run with the current implementation (please, correct me if I am wrong). To summarize, while the described ideas are interesting, it is difficult to understand for me: 1) which of the two contributions -- structured learning and latent variables -- has the largest impact and especially if both of them are necessary to obtain the reported improvements. 2) whether a simpler alternative, such as learning directly all classifiers, is as infeasible as claimed in the paper. Miscellaneous other comments and questions: - end of page 2, start of page 3: it is said that Local Coordinate Coding and Super-vector Coding are used to reduce the size of the image descriptor vectors. I believe this is a misunderstanding as LCC and SV features are typically very high dimensional (as used by their authors). - page 6, left column: " we would like to leverage the already good classification performance of existing binary classifiers, simply by reweighting them ...". I found this sentence ambiguous. What you reweight are not the classifiers but the dimensions of the classifiers. - page 6, left column, line 38.5: "from" -> "form" - I guess there is a minor error in the gradient of equation (14): it should be 2 lambda theta (or the regularization should be lambda/2 ||theta||^2 - Start of section 4.2: ILSVRC consists of 1.2M training samples + 50K validation + 150K test (and not 1M training samples + 250K test) - Is there a reason why all the weights of Fig 2(a) are non negative? Reviewer #2: The work proposes to use a SOSVM to train a "meta-classifier" that multiplicatively adjusts class level classifier weights in a hierarchy. Extensions are shown for the multi-label setting where a hierarchy is given, and to a latent variable case in which an asymmetric weak labeling is accurate for given labels, but missing labels may or may not be present. Empirical evaluation is performed on a variety of datasets, showing improvement with the proposed modifications. Overall, the work is interesting, and the experimental validation is good. The main weaknesses are: 1) The writing style is too imprecise and informal. Too much hand waving and reference to the conference paper. It can be assumed that the majority of potential readers will not have read this, and will approach the paper from a different perspective from that assumed by the narrative of the current draft. More precise language, and avoiding unsupported statements will increase the appeal to a wider audience. 2) The literature review is quite focused, and does not seem to consider a bigger picture of how the work fits in the field. 3) A greater focus on making the paper self contained. E.g., the authors refer to the url of the main author's main webpage for an implementation at one point rather than making the description of the algorithm self contained within the journal paper. As a result, I believe the paper needs a major revision, focusing largely on presentation of the material. Other important areas of improvement/clarification: * Justify the choice of a multiplicative re-weighting, either theoretically or through some kind of empirical support. Compare to the hierarchical prediction method of Cai & Hofmann CKIM 2004 (reference missing) at the very least on some smaller dataset. If it is impractical to extend this to very large datasets, show at what size it breaks down in practice, and show what you lose (if anything) from using your lower capacity model. * Can the model be related to a different form of regularization? Please make this explicit. * Please introduce notation as it is used. Although the table of notation is nice, it is difficult, especially when reading the paper on a computer screen, to keep jumping back to the table to remind oneself of the meaning of a specific variable, especially as the notation does not exactly conform to all other papers (not a problem if it is possible to read through without memorizing the notation table). * Please explain whether and how the choice of a multiplicative re-weighting leads to a restriction of function spaces that can be represented. If it does not, please derive, e.g. a kernelized version of the approach. If it does, please make this explicit. The authors seem to only consider linear models in the paper. Missing citations: Hierarchies have been considered in a number of contexts in the computer vision literature, but this is not reflected in the literature review. An additional area is the use of discriminative latent variable models with weak supervision. The following references are missing, for example: A. Binder, K. R. Müller, M. Kawanabe, On Taxonomies for Multi-class Image Categorization, International Journal of Computer Vision, 2011 Evgeniy Bart, Ian Porteous, Pietro Perona, and Max Welling, "Unsupervised learning of visual taxonomies", in Proc. CVPR, 2008. J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, A. A. Efros, Unsupervised Discovery of Visual Object Class Hierarchies, IEEE Conference on Computer Vision and Pattern Recognition, 2008. M. B. Blaschko, A. Vedaldi, A. Zisserman, Simultaneous Object Detection and Ranking with Weak Supervision, Advances in Neural Information Processing Systems, 2010 Also google "weak supervision computer vision" Specific notes: P2 col 1 line 140 - at this point, it has not been described that the setting will be to pick a fixed number of labels. It seemed that the best thing to do would be pick all labels. The setting should be introduced earlier so it is clear that this is not a degenerate problem. "this type of performance measure" is an example of the imprecise language that plagues the paper. Eq (6) + loss functions - it should be emphasized more that these are the main practical differences making this work novel (Eq 6 and 3 + the equation following 3), please number all equations Eq (12) - I had to look up Omega in the table at this point, please introduce variables as you use them in the main text rather than relying solely on the table. P7 col 2 line 44 - state that the greedy algorithm leads to the optimum of the subproblem. Line 56 - "excellent" sounds like "good enough already" which turns off the reader - this same language appears at least once later in the paper. P8 col 1 - explain c better (around line 25). Explanations here and throughout are very informal P8 col 2 line 26 - It is not appropriate to refer to code when further explanations are needed, especially code that does not have a dedicated url on an archival site (e.g. MLOSS, or sourceforge). Please make the paper self contained. I very much appreciate that the authors make code available and applaud them for the effort, however the paper still needs to have all important details. P9 col 1 line 30 - Direct regularization of the squared norm of the parameters implies a prior dependent on the problem parametrization used by the base classifiers. This should be made explicit and it should be discussed what restrictions this introduces. Is it possible to use another regularizer that lives in a more general Hilbert space, even if a specific choice of this space leads to squared norm regularization of the ambient parameters? Is there a meaningful extension to kernelized classifiers? Otherwise, I have trouble believing the claim on P8 col 2 line 41 that the method is readily extended to other state of the art classifiers. (Admittedly in the context of detection) Vedaldi et al., 2009, showed that non-linear kernels can be essential in achieving state-of-the-art performance. Can we use different feature spaces for different classes? Color may be important for some classes, while texture may be more important for others, as a simple example. Previous experiments on multiple-kernel learning have shown that a concatenation of the feature spaces may or may not be appropriate (e.g. Gehler and Nowozin) Please state the restrictions or flexibility of the model. P8 col 2 line 58 - what does "rational" mean in this context? This is informal and is not a convincing argument. Please use scientific prose with logical entailment. P10 col 1 line 55 - "possibly the least appropriate" - this statement is informal and unsupported at this point in the paper. Please use precise language that is supported by clear objective explanation. "possibly" is a "weasel word": http://en.wikipedia.org/wiki/Weasel_word P10 col 2 line 44 - "previous experiment" - at first I wasn't sure if you were referring to the conference paper version or the previous section. Please use the section number here. p12 col 2 line 39 - is it because of the loss, or just correlations present in the training data? Fig 6 & 9 - label the y axis and describe more clearly how this is measured The results and discussion sections are mixed, which is not good scientific writing style. Please better approximate the classic AIMRAD scientific paper outline: http://en.wikipedia.org/wiki/IMRAD P14 col2 line 42 - "Robust" is not used in the standard form for vision/learning/optimization. Please use more appropriate vocabulary here and throughout the paper - please define the term in a precise way when introducing it. Empirical validation is overall interesting. The model proposed in the paper is on the low side for novelty, but just clears the bar. Improved writing would aid the paper substantially. Reviewer #3: This paper aims to optimize a robust loss function for multi-label image annotation. This loss function differs from conventional 0-1 loss in two ways: (1) the ground truth labels may be incomplete and the loss function does not penalize predicted labels that are not in the ground truth, and (2) the loss also takes into consideration the hierarchical structures between categories. This is an interesting problem with practical value. This robust performance measure naturally occurs in image annotation problems because it is very hard for the ground truth labels to be exhaustive. The technique proposed in this paper is novel. The loss is optimized by solving a structured SVM with latent variable representing the hidden ground truth labels. The solution is elegant and general enough to handle a variety of situations. The paper is well written and easy to follow. References are adequate. Experimental results are reported on multiple datasets. And the effectiveness of the proposed technique is convincingly demonstrated. The current draft is acceptable for publication with some minor revisions: (1) (Sec.5) A more plausible reason that the smallest hierarchy has the best improvement could be that in optimizing the 0-1 loss for a large hierarchy, the confusion is already hierarchical and aligns well with the semantic hierarchy, i.e. semantically close classes are confused more, as shown in Deng 2010. This naturally optimizes the hierarchical loss, as the hierarchical loss encourages confusion between semantically similar categories. However, for small hierarchies, the confusion pattern is less hierarchical, i.e. more like a flat, distinct set of classes. Thus optimizing the hierarchical loss changes the predictions significantly. (2) Page 9, L.39. It appears that the "ImageNet dataset" with 1.25M images is the data used in the ImageNet challenge. If so, a more accurate description of "the ImageNet dataset" is "the ImageNet Challenge dataset" or "the ILSVRC 2010 dataset", as the actual ImageNet dataset is much bigger. (3) It would be informative to see real results on example images before and after learning. Moreover, I have the following suggestions for the authors to consider. I believe that those can further strengthen the paper. (1) Since you already have the probabilities from the binary classifiers, a straightforward way to account for the hierarchical loss is to predict the $K$ class labels with the smallest expected loss. It would be interesting to see how that compares against the proposed method. (2) Right now the learning technique improves over binary classifiers by considering both factors, hierarchical loss and hidden ground truth. It would be nice to tease them apart and show their individual contributions. One way is set the hierarchical loss to 0-1 loss.