Editor and Reviewer comments:     Reviewer #2: This manuscript uses ML to predict instances of child marriage. It combines both statistical and qualitative analysis to identify which variables are relevant predictors. Although the manuscript addresses a relevant substantive question, it has some severe limitations that I have listed below. 1.      It is unclear why the three ML algorithms were selected. There are at least 200 ML algorithms (see e.g. Carot package in R). Please provide reasons for your choice. Relatedly, for general prediction, why not just use an ensemble that tends to perform better? 2.      LASSO and Ridge do not capture interactions between variables. Although CNN does capture interactions, it might more data (more children) to capture these interactions. Since you use one hot- encoding, there will be a lot of interactions hidden in the data that the Lasso and ride will miss. Random forests, however, tend to much better in finite samples. It also captures interactions better in these settings. I recommend testing it. 3.      Variable importance (feature selection) is not appropriate for Lassor and Ridge. The regularizing penalty will arbitrarily (rather than systematically) push variable coefficients toward zero in different initializations of the model. See e.g. Mullainathat Spess 2017 p. 97. For interpretation, I recommend testing a random forest instead. 4.      Unclear why we focus specifically on 20-24-year-olds, and not some other "more recent" group, say 18 to 20, or a "less recent" say 25 age and close by. Please provide substantive arguments and test the robustness of your results with a larger group. The international standard for prevalence estimates of child marriage are based on prevalence of its occurrence among 20-24 year olds, because this age offers sufficient time for a marriage to have occurred, while at the same time being young enough to reflect on more current marital practices. 5.      What was the average and spread of the age of child marriage? 6.       One-hot encoding is likely to provide a lot of low signal variables. Different representations of the data set will likely provide different results. See e.g. the following article, https://arxiv.org/abs/1908.09874v1 7.      The many variables your dataset has is a direct consequences of you using one-hot encoding. In other words, the entire procedure could have been shortened or avoided if another encoding would have been used (e.g. keeping the original variable values). Please show the robustness of your procedure by using another encoding procedure. 8.      How was the traditional logistic regression estimated? It is unclear if you used cross-validation there too or just a one estimation procedure on the entire training set. 9.      Figure 1: the approach to identify predictors looks arbitrary. (1) Within the paradigm of deep learning (i.e. NN), feature selection is done automatically. So for this algorithm doing step one is not helpful and likely producing suboptimal results; (2) For the reasons I listed above, the features generated by a Lasso are likely arbitrary altogether. Variables with similar levels of correlation with each other and with the outcome (as well as other covariates), will arbitrarily suppress parameters. See Haste et al Elements of Statistical Learning. 10.     Unclear how the knee point is defined. A novice ML reader (a large portion of the SSM readership) will not understand your definition. 11.     Figure 2: Again, for the reasons I listed above, I am not at all convinced why iterating between L1 and L2 is at all useful to identify predictors. A NN and tree/forest algorithms accomplish these automatically, implicitly, and systematically. Accordingly, the workflow seems arbitrary and with no support in the ML pipeline. Nonetheless, I should say that I do like the fact that the authors have a dedicated qualitative method step to validate and interpret the variables. My point is that that should be done after the appropriate algorithms and pipeline have been initiated—i.e. in the post estimation phase. Now, there are too many human interferences in the machine-learning estimation stage. 12.     Despite the fact that the authors have a dedicated qualitative step and a lot of effort to it (two coders), that part is not shining through in the work. The reader is presented with a few paragraphs only. I recommend building out that part more. That is what makes this manuscript unique. 13.     The religion variable is likely highly predictive. The coding of Muslim is ok but "Hindu and others" is too broad of a category. I would like to see how the model is affected when using a more granular categorizations. 14.     It seems to me that the case variable suffers from a similar limitation as the religion variable does. Please separate. 15.     It is written, "" While still relatively new in terms of use in public health, we can find its application in the areas of environmental health, physical health, and cognitive health. 23,26,27". I would mention some recent publications in SSM Daoud et al "Predicting women's height from their socioeconomic status: A machine learning approach", and Seligman "Machine Learning Approaches to the Social Determinants of Health". Reviewer #4: This paper uses a machine learning-based hypothesis generation approach to identify variables associated with child marriage in India based on data from a national survey. While such approaches are commonplace in other disciplines, their use for this particular question is quite innovative. In addition, the paper introduces a novel iterative categorization approach that uses domain expertise in an iterative fashion to identify themes that connect modestly associated features, which would be missed by purely quantitative approaches due to the presence of stronger predictors.  While this is a novel and important contribution, there are some methodological concerns that would benefit from clarification (see comments). Using these methodologies, the paper identifies two new themes: (1) non-utilization of health system benefits, and (2) exposure to media. The paper is clearly written and easy to follow. One major issue relates to the actionability of the findings (see comment #1). Additional minor points of correction/clarification are listed below: 1.      The motivation for this work is the identification of previously unknown correlates of child marriage in India. However, it is unclear whether the newly identified themes (and features) are actually actionable in terms of interventions. First, these features (and much of the survey itself) reflect life after child marriage rather than before. It is not clear how interventions can be devised to address issues of limited exposure to media. Second (and more subtly), the iterative categorization approach is no doubt valuable but tends to select for not-so-predictive features (as can be seen in Fig. 3b). It would be beneficial to readers if these are caveats are addressed in the discussion and some strategies to devise interventions are detailed. 2.      I am slightly concerned about how the iterative categorization was implemented to ensure no information leaks. I am assuming that in Fig. 2, the entire cross-validation + holdout test set protocol was retained? If not, I worry that either the holdout set or the entire data set is repeatedly used which could lead to inflated maximum likelihoods, and incorrect features/themes. 3.      The paper is sparse on some of the machine learning-specific details. How were the hyperparameters (number of epochs, training algorithm, was the model trained in stochastic, batch or mini-batch mode) chosen for the neural network? What were the values for these? Why the choice of four hidden layers? 4.      At what score threshold is the BER calculated? Does it use the optimum values or is it arbitrarily set to a value such as 0.5? The former is perhaps the more ideal way to implement this. 5.      On page 6, line 3, typo: "th" 6.      On page 9, a brief description of the pre-processing steps would be useful. What were the dataset- specific pitfalls and errors that commonly needed to be addressed? 7.      On page 15, AUC and accuracy are distinctly calculated measures in machine learning. At the expense of sounding pedantic, it would be better to clarify that AUC can thought of as an indicator of but not a synonym of accuracy.   8.      On page 20, it is not clear how the claim about neural networks not being amenable to iterative theme generation can be made. If I understand the methods correctly, a permutation approach is used to calculate feature importance in terms of mean squared error. Why could one not simply replace the coefficient-based approach in Fig. 2 with this permutation approach for neural networks? Granted that perturbing individual features may have more complex effects downstream in a neural network but sensitivity analyses are common approaches for explanability in neural networks.