First revision: Reviewers' comments: Reviewer 2: The authors present findings from a study evaluating the accuracy of an artificial intelligence (AI) model, HAMIL-Net, for reviewing imaging of foot and ankle injuries. The authors conclude that their algorithm could be potentially used for "instant diagnosis" and thereby reduce costs and free specialized clinical personnel in overwhelming mass casualty events to deal with the most serious cases. These types of algorithms have potential utility and deserve to be studied but it is clear from the authors results that HAMIL-Net is not currently ready for clinical use. The main issue with this manuscript is that the authors seemed to tailor their analysis and presentation of results to show the most favorable for HAMIL-Net while ignoring important nuances in the data and their implications for diagnostic accuracy. Perhaps the greatest indication of this favorability is when they treat findings of "abnormal" as a "fracture" and "fracture" as "abnormal" in their " 3-way" "integrated detector" (see Table 2). This inflates diagnostic accuracy and AUC. If this is HAMIL-NET at its best, I think it is fair to question whether this is an appropriate tool for making diagnoses and guiding treatment. This reduces radiology to the status of a screening tool, which is arguably too limiting considering the time and financial costs required to obtain images. Furthermore, the overall accuracy of this "integrated detector" is only 77% which leads to the question of whether an overall error rate of 23% is clinically acceptable. Furthermore, AUC should not be the only measure used or heavily emphasized in diagnostic accuracy studies because it is an overall measure that does not reflect all aspects of the data that are used to calculate it. The only circumstances in which AUC could be considered an excellent measure of diagnostic accuracy is when it is >0.9 and preferably when it is >=0.95. Only at those levels are sensitivity and specificity likely to both be high and similar in magnitude. Trade-offs between sensitivity and specificity at various cut points need to be carefully examined and understood along with other measures such as false positive and false negative rates and positive and negative predictive values before any conclusions about the accuracy and clinical utility of a test, device, or algorithm can be accurately made. A detailed examination of the data presented in Table 2 illustrates the point I made above. It reveals several nuances that would need to be addressed before HAMIL-Net could ever be considered for clinical use. For example, the specificity of the 3-way "integrated" model is only 65.27%. That means that over one-third (34.73%) of those with normal imaging results were incorrectly identified as having abnormal results. Is that an acceptable error rate? The detailed examination further revealed that the sensitivity of 83.6% is based on both fractures and other abnormal results being identified as "abnormal." In other words, that sensitivity is based on fractures and other abnormal findings NOT being classified as normal by HAMIL-Net rather than a fracture being correctly classified as a fracture. However, when the accuracy of factures and other abnormal results is defined more rigorously, such as fractures being correctly identified as fractures, the sensitivity is well below the 83.6%. For example, there were a total of 3,300 fractures used in the 3-way "integrated detector" in Table 2 (480+ 2,017+803=3,300) Of those, HAMIL-Net correctly identified 2,017 as fractures with 480 (14.5%) being incorrectly identified as normal and 803 (24.3%) being incorrectly identified is abnormal. That is an error rate of 38.8% which corresponds to sensitivity for fractures of 61.2%! Likewise, the error rate for other abnormal diagnoses is 40.4% (2,119 were classified as normal and 2,949 fractures were classified as abnormal but not as fractures), which corresponds to a sensitivity of 59.6%! The 83.6% sensitivity is based on the imaging finding something that is not normal. That is the most minimal possible definition of a clinical problem. It would be tantamount to telling a patient that "imaging revealed that something is wrong." How could appropriate treatments and expected time courses for recovery be determined from such a vague assessment of the injury? How is this an improvement over the current system that relies on radiologists to interpret images? The results and conclusions as presented in this manuscript are misleading and do not reflect a full examination of the data available to the authors. This manuscript at best represents a starting point for future research about and discussion of HAMIL-Net. The authors need to discuss the details of their results fully and more accurately instead of emphasizing only the best possible results obtained under the least rigorous, unrealistic, and least clinically meaningful definitions of imaging results. It is clear that HAMIL-Net needs much more research before it could be considered fit for clinical use, if it ever achieves that. Clinical context matters. What specific types of fractures and other abnormalities are driving accuracy and inaccuracy? HAMIL-Net's ability to accurately identify SPECIFIC types of fractures and other abnormalities needs to be determined. This is what the authors need to discuss and there is no guarantee that HAMIL-NET will be able to do this. Reviewer 3: Major Strengths: They introduced a fracture detector and an abnormality detector which was interesting. And the annotation which was done using the reports written by many radiologists to mitigate bias was interesting. The HAMIL-Net architecture is insightful. Major Weaknesses: Dataset splitting and some other factors such as dataset imbalance are not clearly discussed. Specific issues that need to be addressed by author(s): 1) the prior work discussion needs citations "The current AI fracture detection models were trained to generate binary labeling (usually also with a numerical score) indicating the presence or absence of fractures given one or more radiographs." 2) "HAMIL-Net performed well in our preliminary studies when trained and tested to detect presence of abnormalities of upper extremities with the MURA data set. This paper reported our evaluation of HAMIL-Net for detecting foot and ankle fractures in the presence of other abnormalities." needs citations 3) "We did not augment the data by mirroring etc. as in many deep learning computer vision research." needs citations 4) It would be great if you could add some more details on how the annotation using the PyContextNLP is performed. 5) "We then randomly sampled about 20% from each dataset as the holdout set to evaluate the performance of fracture detectors. " how was the split took place? Based on patients? Or totally random? If a patient had more than one sample did you held out all samples belonging to that patient? 6) "Each dataset was used either as the training or as the test data" I didn't quite understand how the training testing was performed. Could you please make it more clear? Each model was trained on a single dataset? Or trained on one dataset tested on another? The first paragraph of the results is clear but this paragraph is confusing. 7) Did you have a validation set to tune the hyperparameters? How was the hyperparameter tuning performed? 8) "We note that many fracture detection models reported previously had AUCs higher than 0.9 but … " needs citations 9) There is a class imbalance observed from the confusion matrices in Figure 1 and table 2 and 3. did you perform any steps to cure this imbalance? 10) Is validation set mentioned in table titles the same as held out test set? If so, the held out test set should not be used for validation. If this is not what the authors intended, it has to be cleared in the text. Validation set is not mentioned in the main text. 11) "FractureDetect, a more advanced upgrade of OsteoDetect from the same company, was tested on detecting fractures in 17 body parts by combining outputs from an ensemble of 10 convolutional neural network (CNN) models. BoneView is also based on CNN and was approved to detect fractures in extremities, ribs, and spine across radiographs. These systems achieved sensitivity above 0.9 and close to 0.9 in specificity on ∼10K test cases. " needs citations Reviewer 4: This is an interesting manuscript. The ability for the AI models to perform well even when there are other abnormalities in addition to fractures is noteworthy. The general readability of the manuscript could be better. A careful review with an eye to simplifying and correcting minor grammatical errors would make the material more accessible to your readers. A few recommendations follow: Pg. 5, line 10. Foot and ankle fractures are the [most] common military health problem. Pg. 5, line 33. multiple view. Should be multiple views. Pg. 7, line25-28. Untreated fractures may severely aggravate. This is an odd phrase that seems incomplete. Aggravate what? Consider rewording, e.g., Untreated fractures may have long-term consequences such as . . .. Pg. 7, line 48. which are cleared with fractures and can return to duty. This wording does not make sense. Generally, service members would not be cleared to return to duty in a combat zone, at least not full duty, without some form of treatment and a profile outlining duty limitations. Pg. 8, line 19. it is expensive for a large number of training examples needed. Consider rewording, e.g., it is expensive when a large number of training examples are needed. Pg. 8, line 24. from normal healthy ones. Consider rewording for clarity, e.g., "compared to normal patterns" or "compared to normal images" or "compared to normal controls" or maybe even just "compared to normal." Pg. 8, line 49. The study team have. Should be The study team has . . .. Pg. 9, line 4. This paper reported. Should be This paper reports . . .. Pg. 9, line 39-42. creation of different pairing of training and test datasets. Should be pairings, plural. Pg. 10, line 40-43. Among these studies, 148,278 studies for 513K unique patients were for foot and ankle. Consider rewording for clarity, Among these studies there were 148,278 foot and ankle studies for 513K unique patients. Pg. 11, line 7. Change rule-base, to rule-based. Pg. 11, line 12. Delete "of." Pg. 12, line 4. either had fractures or generally healthy. Change to either had fractures or were generally healthy. Pg. 12, line 15. Table 1 showed the sizes. Change to present tense, Table 1 shows the sizes . . .. Pg. 14, line 4-16. There is an "either" in line 7, but no or appears in the rest of the sentence and it is not obvious which of the several sections you wish to highlight. Please reword and consider breaking the lengthy sentence into shorter ones. Pg. 14, line 19-22. Consider rewording for clarity, e.g., Whether radiologists had or had not expressed high confidence when reporting fractures did not make any significant difference when comparing all corresponding pairs (P-values > 0.05). Pg. 14, line 25. Table 2 showed the confusion matrix. Change to present tense, Table 2 shows the confusion matrix . . .. Pg. 14, line 35. Consider changing "performed relative stable" to "performed in a relatively stable manner" for clarity and grammar. Pg. 14, line 50. Recommend changing "that three classes were all important" to "that all three classes were important." All important has a different connotation, which is confusing. Pg. 14, line 55. Favor seems an odd word choice when referring to a statistical model. Do you mean advantage? Pg. 15, line 4-9. While the "certain all normal" model under this labeling also achieved an F-score as high but the model cannot distinguish fracture from other abnormality. Either delete the opening "while" or delete the "but." Pg. 15, line 17. Table 3 showed the 3-way classification accuracies. Change to present tense, Table 3 shows the 3-way classification accuracies . . .. Pg. 16, line 30-37. In the Methods section you state that the x-ray images are of "Veteran patients." There is no mention of service members. Though it is possible for some Regular Army or Guard/Reserve Soldiers to receive care in a VA facility, it is relatively uncommon. Here, in the Results section, you point out that the age group [18-35] matched that of the Service Members (normally not capitalized, especially in journals) better than the older group (35+). I am not sure, but I think you are trying to say that your younger age group is a better match with the age demographics of active service members. If this is correct, consider rewriting this paragraph for clarity. Also, you might want to consider expanding on what this finding means in the Discussion section. For example, what are the ramifications for using these models, DoD vs. VA settings. What might account for the different results for the two aggregate age groups? I suspect that the fact that the Veterans in that group are older and more likely to have other abnormalities and problems is more important than just the fact that there are more Veterans in the older age group of your study population. Pg. 17, line 12. alone side. Suspect you mean alongside. Pg. 18, line 14. otherwise prohibitively expensive. Incomplete sentence. Expensive to include in the model? Expensive to differentiate? Expensive to categorize? Second revision: Reviewer 2: I thank the authors for addressing my previous comments. They have changed the emphasis of the manuscript and clearly stated that HAMIL-Net needs more work before it can be used clinically. I appreciate the authors for acknowledging that, however, this manuscript still does not seem to fully address the clinical inadequacy or the complexity of the research that would be required to develop and evaluate a fully automated foot and ankle injury imaging reader. For example, the authors replied to one of my comments that "the results reported serve as a retrospective assessment of the performance of an AI model rather than an indication of its clinical efficacy." This response gives the impression that the author's goal was simply to build an AI model and assess its accuracy, in other words an academic exercise. If that is the case, why submit the manuscript to a medical journal focused on the needs of the military rather than a journal focused on AI? If the goal is to evaluate and possibly improve HAMIL-Net's ability to correctly interpret images of orthopedic injuries, then diagnostic accuracy is of paramount importance and not just a method for evaluating the accuracy of AI. The authors also did not seem to consider any clinical context that might have improved the focus pf their study and possibly the diagnostic accuracy of their model. For example, they responded to one of my comments that 54% of the false negative facture cases were due to remote fractures that were healing or healed. Why were these included in the study if the potential use for HAMIL-Net in a military context is triage during mass casualty events, as the authors stated in their conclusion? From this perspective, the false negative rate may be too high due to including cases that are irrelevant to an acute care setting. Understanding this context and tailoring the study methods to provide an appropriate assessment that is consistent with that context is vital to providing valid data for scientific decision making. There are other ways the authors pointed out that this study is not realistic. In the last paragraph of the discussion, they indicate that future research should use two output labels "fracture" and "abnormalities" rather than just one output, "fracture." Until this is done, HAMIL-Net is nowhere near being useful in military clinical settings. Furthermore, they acknowledge that their methods "did not faithfully reflect the true condition where both fracture and abnormality may present at the same time." They also acknowledge that their future research would need to account for "views and anatomic regions." Why promote HAMIL-Net to the military without any of this? In its current form, this manuscript says nothing about HAMIL-Net's potential to the military. In addition to the lack of clinical context, the sample that the authors used is not representative of the military population. Why use data from people who are not reflective of the military population when trying to promote HAMIL-Net to the military? The authors reported in Supplemental Table 1 that 29.43% of their sample is aged 18-35. They also indicated in a response to one of my comments that 84% of their false positive cases were seniors! They also indicated that 21.5% of false positive cases were due to degenerative changes and 14.5% were due to post-operative changes. This is not consistent with a military population or a military mass casualty scenario. This could have resulted in an inflated false positive rate. All of these issues do not give me any confidence in the validity of the authors results or that they are relevant to the military. Reviewer 3: Major Strengths: Major Weaknesses: Specific issues that need to be addressed by author(s): Reviewer 4: The revised manuscript is much improved and the authors have thoughtfully addressed the reviewers concerns, in my opinion. Pg. 16, line 55. Should be datasets, i.e., plural. ". . . tested it with the holdout set of one of the datasets and reported its performance." Pg. 21, line 4. The word aloneside should be alongside. Pg. 23, line 30. There is a word missing in the sentence. "The results reported here serve as a retrospective assessment of the performance [of] an AI model . . .."