AE comments: Associate Editor Comments to the Author: All reviewers confirm the contributions and organization of the paper. They also have suggestions to improve the formulation, experimental details, and analysis and explanation. I recommend the submission to undergo a minor revision. Reviewer Comments: Reviewer: 1 Comments to the Author This paper proposes an entropy-guided reinforced partial convolutional network, which integrates local features with global features for zero shot learning (ZSL). The local features are selected by reinforcement learning based on semantic relevance and region visual correlations. Partial convolution is proposed to extract local representations efficiently. Partial convolution along with an entropy-guided rewards both contribute to promote the training of the reinforcement module. The model is evaluated on 4 standard ZSL datasets and showed improvements to state-of-the-art methods. Strong points: 1. The paper is generally well-written and clearly structured, and presents all relevant details in a clean formal way. I really like the figures in the paper, which clear illustrate the model structures, 2. The paper provides interesting problems and the proposed algorithm is well-motivated. The ability of progressively exploring localities tasks is worthy of study. And the partial convolution and entropy guidance make the reinforcement-learning-based method feasible. 3. The experimental setup is diverse including zero-shot learning, generalized zero-shot learning, ablations regarding different modules and model efficiency, different visualization analysis, and comprehensive hyper-parameter analysis. The selected baselines seem good to me. And good performance is achieved against the baselines 4. The literature review is adequate. Main categories of zero-shot learning methods are all included. Related works on representation learning are also introduced. Overall, the paper constitutes an important contribution, with some minor issues that need to be addressed in a revision. Weak points: 1. Is the meaning of j in Lj (Equ 4) and ei,j,k (Equ 8) the same? If not, the author should change the reused symbols. 2. Is the symbol <,> in Equ 2, Equ 4, and appendix B meaning the same mathematical operation? The author should explain their meanings to avoid confusion. 3. In Experiment section F, as far as I know, egn and the output of fcj are of different dimensions. egn is visual embedding while output of fcj is attribute. How can egn and the output of fcj be added? 4. I wonder how the datasets are split into seen and unseen classes. Will the split influence the experiments? 5. Why the locality diversity loss Lm in Equ 5 can guarantee the local subnet to capture diverse localities? The author should explain more why the Lm works. Reviewer: 2 Comments to the Author The work submitted is focused on the challenging topic of zero-shot learning, aiming at designing deep neural network architectures to automatically recognize the categories that did not appear during the training procedure. The topic is surely relevant and interesting for the IEEE T-CSVT industry and academia communities. The major paradigm of using entropy analysis and reinforcement learning to improve models to better fit different tasks looks interesting and reasonable. Overall, the manuscript is reasonably well written and organized, the proposed idea is technically convincing and described with a sufficient amount of detail. Compared to the existing zero-shot learning algorithms, the proposed algorithm is relatively well performed. While I think the paper can be potentially accepted, authors are encouraged to address my below concerns in the revision. 1. In the introduction, it would be great to give key dimensions of the paper about, a) why the research problem explored by the study is important b) what could be the good impacts/applications based on the technique proposed in the real world? 2. Unfortunately, the related works section, although not too brief, lacks completeness, since it does not include a few key works published in the last three years which, performance-wise, represents the real current state-of-the-art for zero-shot learning. Preferably, I'd like to see a comparison against those algorithms. Knowledge Distillation Classifier Generation Network for Zero-Shot Learning HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning Zero-shot learning with transferred samples Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning Episode-based prototype generating network for zero-shot learning FREE: Feature Refinement for Generalized Zero-Shot Learning Label-activating framework for zero-shot learning 3. In my experience, RL is not easy to train. Did you encounter some problems when training the RL for a zero-shot setting? More insights will be helpful. 4. In section 3.4, it seems that the authors use different step ranges for different datasets. Why is that? 5. In Fig. 4, it seems that authors choose different approaches in the comparison, which is not usual. Could authors explain why? Reviewer: 3 Comments to the Author The authors propose an entropy-guided reinforced partial convolutional network for zero-shot learning, which is novel and impressive. The authors propose to use reinforcement learning to enhance the convolution process in the neural networks, which can boost the model performance. The experimental results are robust and solid. This method is promising and can be used in a wide range of machine learning applications. However, some of the analysis is missing. Strengths: 1.This paper contributes a novel idea, i.e. ERPCNet, which extracts and aggregates localities progressively. The idea is interesting and can be widely used in other neural networks. 2.The experiments are sufficient. The model performance is better than a lot of state-of-the-art methods. The ablation study is solid. The visualization of learned feature embedding and the reinforced decision process are impressive. All of them demonstrate that the proposed model can progressively pick up the best locality to help distinguish similar or diverse objects effectively. 3.Overall, writing of this paper is clear and easy to understand. Weaknesses: 1.Lack the quantitative analysis of entropy in the training process. 2.There are various issues with the mathematical statements and notation: a)<> is used to defined arrays in the whole paper and used differently in the supp. mat as well. b)\pi is defined both as a sampler and policy network? 3.Some errors are included in this paper, e.g., "Figures 4", "Figures6". 4.The proposed method involves the training of three sub-networks and the optimization of many losses. Have the authors compared the training time with other methods? 5.The proposed method and analyses should be explained more clearly. a)How do you calculate joint prediction for each step? It is defined as multiplication of the f_cj and the e vector. But, I am not sure whether it is typo. b)How do you estimate density values in the calculation of entropy in (8)?