Reviewer #1 Questions 1. PAPER SUMMARY What is the paper about? Please, be concise (2 to 3 sentences). This paper extends prior work to adapt models trained from vision tasks to perform language tasks without modifying the weights of the vision model using an Adversarial Reprogramming paradigm. The experiments are performed using sentiment, topic, and DNA sequence classification tasks and the work shows that modestly competitive performance can be achieved compared to models native to the modality. 1. PAPER STRENGTHS Please discuss, justifying your comments with the appropriate level of details, the strengths of the paper (i.e. novelty, theoretical approach and/or technical correctness, adequate evaluation, clarity, etc). * I do think the results are a useful proof-point for the broader AI community for what can be done with adversarial reprogramming. * I'm glad that this work explored what happens in the context where the target task is greater than in cardinality than the victim task as this seems to be a main criticism of some seminal works. * It's good that the authors explore the impacts on a variety of text classification tasks & datasets. 1. PAPER WEAKNESSES Please discuss, justifying your comments with the appropriate level of details, the weaknesses of the paper (i.e. lack of novelty – given references to prior work-, lack of novelty, technical errors, or/and insufficient evaluation, etc). Note: If you think there is an error in the paper, please explain why it is an error. * Lack of inductive hypotheses: The paper provides interesting empirical results in a focused setting for which I wish the authors spent some time hypothesizing reasons. So now it seems like vision models can be used for text tasks just by constructing inputs in the right ways, what does this tell us about the underlying vision models? Why might these vision models latently encode information useful for text tasks? Answering some of these questions can absolutely be considered out-of-scope for this work, but it's useful to articulate some of these even as a directional for the rest of the community to benefit. That said, I do think some questions should've been expounded on more — for example, the authors motivate the whole work with transformers given there is not some architecture alignment between text and vision models in the community, but beyond presenting results, there's little _explanation_ work done. I don't believe it's sufficient to present results, it's also the job of works to try to explain them; at a minimum, offering a hypothesis. Likewise, with the pre-trained versus randomly initialized experiments — is there some "deep image prior" (arXiv:1711.10925) that we're seeing from the randomly initialized? How do we explain why they are actually quite competitive to the pre-trained? * Mischaracterized complexity: If I understand the inference latency costs in Supplemental S1, I believe that the latency is mischaracterized with an apples-to-oranges comparison since the comparison should not only include the Adversarial Program in Algorithm 1, but also include the cost of running the victim image model as well as the label remapping. This gets to what I believe to be another issue, which is that vision models are often many times larger in model capacity (parameters) and computational complexity than language models. As a result, it feels this is not significantly well studied or at least discussed in the work since this difference is likely to impact attack likelihood and patterns. * Relevance: I feel mixed about the the relevance of this work to this venue. On one hand this work, performs experiments on text classification which this community may be less well equipped to assess correctness, fair comparison, etc. And on the other hand, this study casts light on what is possible with vision models in show how they might be used to solve non-vision tasks as well (although, we can debate this empirical finding is likely not unique to vision models). * Scope creep: I think there is a bit of over-reaching with the title and some of the write-up of the paper as it tries to frame as "cross-modal", for which the work does provide a proof-point for a smaller area of vision-to-text classification. I would be satisfied on this point though if the authors committed to cleaning up the write-up and maybe adding a subtitle or appended to their title a qualification that this is just starting point with vision-to-text classification. 1. RECOMMENDATION Borderline 1. JUSTIFICATION Justify your recommendation based on the strengths and weaknesses. Please be considerate to the authors and provide constructive feedback. Borderline, leaning conditional acceptance on improving the write-up. The work presents an interesting empirical study on what's possible with vision classifiers in the text classification setting using adversarial reprogramming techniques. The experimental sections feel like a results coredump that simply argues that it's competitive or possible versus more rigorously interrogating why. There is some curious behavior with respect to randomly initialized vs pre-trained and the latency results that hint that performance is much more modest in competitiveness than claimed. Still the experiments are a first-of-kind that I think would be interesting to encourage the community to dig into further and this work provides a starting place. Reviewer #2 Questions 1. PAPER SUMMARY What is the paper about? Please, be concise (2 to 3 sentences). The paper extends the adversarial programming technique to the cross-domain setting. Concretely, it proposes a technique for repurposing pre-trained image classification neural networks for sequence classification tasks. 1. PAPER STRENGTHS Please discuss, justifying your comments with the appropriate level of details, the strengths of the paper (i.e. novelty, theoretical approach and/or technical correctness, adequate evaluation, clarity, etc). * The paper is well written and easy to follow. The topic addressed is very relevant as more and more pre-trained models are becoming available. * The work also reveals the threat imposed by adversarial reprogramming towards misusing pretrained models. Security experts will now have to think of all the possible domains that the pretrained model under consideration can be misused. * The paper also considers the setting where concealing the adversarial perturbation is desired. Additionally, the paper also considers the case where the original task has fewer labels than the target tasks. * The evaluation on repurposing the image classifier for sequence classification is quite thorough. Adversarial reprogramming is evaluated on top of four neural architectures covering CNN and vision transformer architecture. Diverse sequence classification tasks in NLP and DNA sequencing are evaluated as target tasks. 1. PAPER WEAKNESSES Please discuss, justifying your comments with the appropriate level of details, the weaknesses of the paper (i.e. lack of novelty – given references to prior work-, lack of novelty, technical errors, or/and insufficient evaluation, etc). Note: If you think there is an error in the paper, please explain why it is an error. * I would have liked at least one more victim model domain (apart from image) to make a claim that adversarial reprogramming works in a cross-domain setting. * I find the paper medium on novelty. Adversarial reprogramming is already an established approach. To extend it to a cross domain setting (repurposing image classification models for the task of sequence classification), this work embeds each token and uses the embeddings as patches to get an image. * To evaluate the setting where the original task has fewer labels than the target tasks, this work constrains the adversary's access to a subset of labels. However, the pretrained model already has learnt classification on all the labels. To truly evaluate this setting, it would be desirable to use a pretrained model that has been trained on fewer labels to start with. * The results of adversarial reprogramming are shown vis a vis the results on Bi-LSTM, 1D-CNN, and TF-IDF. It would be nice to see the results of adversarial reprogramming in relation to the state of the art sequence classification models (e.g. transformer based models). Although it is not necessary to meet these results in order for the technique to be valuable as adversarial reprogramming is computationally much cheaper . I would have also liked some studies on the effect of increasing the number of parameters in the embedding layer to see what it takes to arrive at the state of the art results. 1. RECOMMENDATION Weak Accept 1. JUSTIFICATION Justify your recommendation based on the strengths and weaknesses. Please be considerate to the authors and provide constructive feedback. As mentioned before, I find the paper medium on novelty. The evaluation is thorough on repurposing image classifiers for sequence classification. I would have liked at least one more domain (e.g. audio) to make a claim that adversarial reprogramming works in a cross domain setting. The topic addressed, however, is very relevant. The work reveals a broader scope of the security threat of misusing the pretrained models in a different domain. Reviewer #3 Questions 1. PAPER SUMMARY What is the paper about? Please, be concise (2 to 3 sentences). The paper proposes a method to perform reprogramming across different domains. Particularly, based on the trained models for an image classification task, the authors propose two components to utilize the image models to serve on other sequence classification tasks. First, a method to transform the sequence inputs (language sentences, DNC sequences) into an image format is proposed. Second, the mappings from the image labels into the sequence class labels are discussed. The authors perform experiments on several NLP and DNA sequence tasks, with the source image models based on CNN and Vision Transformer. 1. PAPER STRENGTHS Please discuss, justifying your comments with the appropriate level of details, the strengths of the paper (i.e. novelty, theoretical approach and/or technical correctness, adequate evaluation, clarity, etc). - The studied problem is interesting and relevant. A good reprogramming method could help to transfer the power of existing models to serve new tasks that they are not originally trained for. - The authors claimed to be the first that perform experiments to reprogram across data domains, i.e. using image classifiers to reprogram to serve other sequence classification tasks in NLP and DNA classification. - The experiments show competitive performance of the reprogramed models compared to the ones trained from scratch. 1. PAPER WEAKNESSES Please discuss, justifying your comments with the appropriate level of details, the weaknesses of the paper (i.e. lack of novelty – given references to prior work-, lack of novelty, technical errors, or/and insufficient evaluation, etc). Note: If you think there is an error in the paper, please explain why it is an error. - Algorithm 1: it looks like the calculation of i and j only works for h = w. If so, this should be mentioned in the algorithm. - Some details are missing to better understand the method: + Is the used patch size p of the reprogrammed method equal to the patch size of vision transformer? What effects it may bring if these patches size are not correlated? L636-639: Do we change patch size of both the source and target tasks? + Are all the reprogramming training (input and label mappings) trained after the source model has finished training, or concurrently with the source model? Does the training modify the source model weights? - The additional number of parameters of the reprogrammed components are not discussed. If the additional params are too high, the benefit of reprogrammed training is vanished. As a big model brought by the additional params could be sufficient to train a good model alone on its own, without needing to be reprogrammed from an existing model. Also, how is the model size of the reprogrammed components compared to the models training from scratch in Table 2? - Table 2: are the models trained from scratch by the authors? Why wouldn't we use the best public benchmarks on these datasets? As it is hard to calibrate and compare if the models were trained by the authors. The best public benchmarks may have higher accuracies. - Table 3 (supplementary material) suggests the linear model on top of the reprogrammed components alone is good enough (except DNA datasets), which downplays the need of reprogramming from an existing model. 1. RECOMMENDATION Weak Accept 1. JUSTIFICATION Justify your recommendation based on the strengths and weaknesses. Please be considerate to the authors and provide constructive feedback. While there are concerns regarding the paper as mentioned in the weakness section, the paper investigates a relevant and interesting technique that extends trained image classification models to NLP and DNA tasks. If the method indeed is helpful, the work could help to broaden the impact and application of existing neural models.