COMMENTS TO THE AUTHOR: Reviewer #2: I have very mixed feelings about this manuscript. The application of machine learning to the topic of violence against women, is intriguing and I can see that it holds potential for yielding some interesting insights for the field. But the framing of this manuscript leaves me unsatisfied. It neither fully explores the potential these methods could have to the field of violence nor offers enough explanation to help those unfamiliar with the methods to understand them better. Nor does it really offer new insight via the particular analysis that they pursue of the NFHS in India. It seems to me that the utility of these methods would be to apply them to questions where accurate prediction of an outcome -- say what best predicts when a woman experiencing domestic violence is at high risk for intimate homicide-- could be relevant for programming or clinical intervention. In such a case, finding unexpected but highly predictive variables could help inform a screening protocol to help identify women at risk. Or using large data sets from social media to predict young adolescents that might be at risk of suicide. Admittedly, appropriate data may not exist to conduct the femicide analysis, but I would be interested in the authors' ideas on where and how such methods could be usefully applied. Applying the technique to cross sectional data from India, does not seem to maximize the potential of the method. It is both unlikely to yield insights that are actionable for practice or different from those derived from more traditional techniques. Because the data are cross sectional, they also cannot be used to explore potentially causal connections. This leaves me underwhelmed with the substantive findings of this paper. Also, the usefulness of the findings seems undermined by the reality that less than 2% of respondents acknowledged forced intercourse or other unwanted sexual acts. We know from other studies from India and elsewhere, that this is a gross underestimation of the true level. Further, the fact that 47% of young women disclosed the incident (which is FAR higher than is typical), suggests that the 2% captured here, represent only the most severe cases. In my mind, this level of misclassification brings the outcomes of the exercise into question. I can, however, imagine a way that the manuscript could be re-worked to make it more useful. This would be as a methods piece, that discusses the potential of machine learning in greater depth, provides more detail on both the methods and what types of data sets and questions are most amenable to these techniques. It could use the India analysis as a worked example of its application -- but here, the India analysis would be a case study of the method, rather than the "subject" of the manuscripts findings and contribution. Personally, I think this would make a larger contribution to the literature and to violence prevention. Reviewer #3: I appreciate the opportunity to review, Machine Learning Analysis of Non-marital Sexual Violence in India. First, the use of machine learning does have the potential for advancing knowledge in the area of gender-based violence, however, I am not convinced by the manuscript that this is the appropriate use of machine learning in the field of violence. The findings do not provide the field with new learnings but reinforce what we know and it is unclear how a government, organization or provider would use this information to inform services and programs. The rationale for using machine learning in this study appears to be access to a large data set on an underreported issue, sexual violence among young women 15-19 years that at the time of the interview were not married. The study would be strengthened by identifying the research questions that necessitate machine learning, rationale that this approach is required to advance our understanding of individual, family, community and societal factors associated with SV among young women would be helpful (given the limitations of cross-sectional design, self-report, etc.). Additionally, detailed description of what machine learning is and how its use informs violence prevention and response policies, practice, research would be useful. The authors do note that underreporting of SV as an issue, with a little more than 1% of the over 13,000 women in the database - but given that almost 50% of the women that experienced SV did report the SV to someone (which is actually relatively high), I suspect this is related to the severity/frequency of the violence these young women are experiencing- which introduces other biases and challenges in interpreting the data and findings. The vast majority of young women knew the perpetrator, which is not surprising - around 40% of young women report SV by a family member, 17% by a partner with only 8% of SV by a stranger. Therefore, for example, the theme titled "mobility/freedom" is important to discuss as a safety strategy, escaping an abusive relationship and family or community, given the reality of the young women's lives (majority living in rural areas, poverty, etc.) and accessing resources to live outside abusive family, etc. rather than an employment opportunity. Women seeking safety can also be vulnerable to other forms of abuse/violence - exchange of sex for safety, housing, food, etc. I can see a use of machine learning in the field by identifying modifiable variables that predict a rare event, such as femicide. For example, factors such as perpetrators access to weapon or gun, could lead to policies that focus on removing weapons and guns from violent partners, or alcohol/drug use - focus on clinical training and behavioral health programs for harm reduction. Reviewer #4: In this study, the authors apply machine learning methods to nationally-representative retrospective cross-sectional data of women aged 15-49 years in India, to identify variables related to NMSV. Comments: The study considers women only - can the authors please clarify this within the title and abstract of the paper? "Data were drawn from the National Family Health Survey (NFHS-4), a nationally-representative, household-based survey, conducted during 2015-16 in India." This data was collected quite some time ago now. How do the authors envisage changes over the last 5 years may impact the study findings? "The sampling strategy for this survey has been described elsewhere.1" Can the authors please describe the sampling strategy in brief here, as it is relevant for correct interpretation of study limitations? "NFHS-4 interviewed women of age 15-49 years old on aspects related to their health, with a focus on maternal, child and reproductive health, access to health services and health providers, and household characteristics related to social and economic status. A sub-sample of women was also interviewed on their experiences of violence, as well as dimensions related to their agency and empowerment." How was the subsample selected from the larger sample? Is this subsample still considered nationally representative, or may there be causes of bias due to subsample selection strategy? Table 1 compares the sample with the NMSV subsample - can the authors add statistical tests to compare these groups for any significant differences? "In addition to the regularized logistic regression models, we also used artificial neural network models to account for non-linear relationships among the predictors". The approaches used by the authors are thorough and rigorous. "The analysis included all variables from the NFHS-4 dataset, as the study aimed to explore potential correlates of NMSV from a large dataset with information on diverse aspects related to the women ... in total, each woman was represented as a set of over 6,500 variables that summarized her characteristics". The regularisation of the regression models helps to overcome the risk of over-fitting, given the large number of variables. Can the authors please discuss if this is also the case for Neural Networks? "As with a typical supervised machine learning task, we first split the dataset into training and test data, with 20% of the data randomly assigned as test dataset. (See Appendix A for details.)." The authors have appropriately split their data into test and train subsets for model development. Of note, no external validation of the model has been presented within this report. "We then conducted our models and evaluated thems using two metrics: Balanced Error Rate (BER), and Area Under Curve of Receiver Operating Characteristic (AUC)". The use of BER and AUC to assess model performance is appropriate in this context. Please note the typo to be corrected here. "This strategy was based on the fundamentals of qualitative coding of information, where domain experts independently review text to generate and code relevant themes.24 Two experts independently coded the text, inter-reliability was tested (>90%), and then coders met to reach consensus for any codes in dispute. A group of variables was identified as a theme when the number of variables within was at least 5% of the total number of identified variables above the knee point of the coefficient curve. A single variable could be included in multiple themes. " Inter-rater reliability is high, which provides confidence. Were the two reviewers of equal experience and expertise in this area? Might having a third independent reviewer help to mitigate any potential bias? Table 1: States that the NMSV sample is 208. This seems small for ML techniques. There also appears to be some discrepancy, as previously in the article the authors state that "The current analysis used this sample of women who responded to questions related to experiences of violence, and who had not ever been married at the time of interview ("never married") (N=13,627). The complete analysis was also repeated for a sub-sample of these women aged 15 to 19 years old (N=8,007), to identify correlates of NMSV specific to adolescent girls". Can the authors please clarify the sample sizes accordingly? Furthermore, how large was the subsample of adolescents within this? In addition to my earlier comment regarding statistical testing to be provided in Table 1 to compare NMSV and the larger sample, can demographics for the adolescent subsample also be included in Table 1, along with further statistical testing to better understand how representative the subsample is to the original nationally representative cohort? Tables 3 and 5: Can the authors please provide coefficient information for each of the predictors listed? Overall, this is a well written article, presenting a valid and comprehensive methodological approach for the clearly stated research aims. The main limitations have been acknowledged in the discussion section. TECHNICAL INFORMATION: When you submit the revised paper, please provide the following: One "clean" copy of your manuscript One copy where your changes are highlighted (tracked changes). A separate, point by point response to the editorial and referee comments, clearly indicating where in the manuscript changes have been made. Any Figures, Tables, and supplementary files (even if no revisions have been made). Please do NOT include a copy of your original manuscript. All text files should be supplied as MS Word files. To submit your revised manuscript, please visit EClinicalMedicine's Online Submission and Peer Review Website at: https://www.editorialmanager.com/ECLINM/ and enter your username and password below: Your username is: anitaraj@ucsd.edu If you need to retrieve password details, please go to: click here to reset your password You will see a menu item call 'Submission Needing Revision'. You will find your submission record there. An important part of EClinicalMedicine's peer review process is the assessment of author contributions and any potential conflicts of interest. If you have not yet supplied your signed statements please do so now. The editors may use such information as a basis for editorial decisions and will publish such disclosures if they are believed to be important to readers in judging the manuscript. In summary, the signed statements we require are: *Authors' contribution and signatures which can be found here. Signed conflict of interest statements for ALL authors which can be found here. Please also check whether you need to provide the following: Signed copyright permissions for previously published material Signed consent from individuals cited in the Acknowledgements Signed consent for use of cited personal communications Signed patient's consent and permission to publish (if not already submitted) I look forward to receiving your revised manuscript.