1. Review for “Global mesoscale ocean variability from multi-year altimetry:an analysis of the influencing factors” Summary The manuscript presents and compares a couple of different statistical methods for characterizing the response of Sea Surface Slope to different oceanic features (environmental parameters). The basis for this work is the demonstrated strong correlation that exists between mesoscale ocean variablity and sea floor roughness (Gille et al. 2000). In this paper, the authors expand the parameter space, beyond just sea floor roughness and identify 27 parameters which could potentially impact the variability of SSS (which in turn can be related to oceanic eddy kinetic energy) and explore a couple of different regression models (Linear Regression and boosted trees) to predict SSS from these input parameters. In my opinion, some of these 27 input parameters can be clubbed together, and some might be redundant as input features. Gradient boosted trees perform better overall, but fake unphysical inputs also appear to confuse the models, demonstrating some amount of statistical variance associated with these models. The training strategy of splitting the data into 64 geographical bins and using the input from different separate regions to make predictions over a specific region is actually pretty clever and ensures model generalizability. Of course, there can be various ways to assess model performance and dependence to input parameters, and the authors present a very nice summary of the parameter dependence by quantifying the rank of each input feature using a couple of methods for each type of model. I felt this could be better represented as a figure instead of a table. The relative feature importance is a major component of the paper, yet there is not a coherent discussion on it, only a couple of tables listing the ranks. Its only a suggestion, but the authors could try to think of a more meaningful, succinct representation of the key results.The figures in general are too small and it is difficult to see the details. I enjoyed reading the paper overall. However I felt it can be shortened quite a bit. Some of the text is repeated in places. Some of the methods/analyses aren’t motivated enough. I therefore recommend minor (but important) revisions. I have listed my detailed comments about specific sections below. Detailed Comments Line 202: “...scale the absolute latitude range from 0 to 2”: Could you explain the motivation for this, instead of just taking the sine of the latitude (which would also give you the coriolis parameter and scale it from 0 to 1). Line 203: “robustly scale each feature to the interquartile range”: Please explain what is meant by this. Lines 189 - 412: Can you say a few words about how the various inputs are scaled, and the motivation/ reasoning for using that approach. Lines 207 - 268: Features 1 - 8 are all in some way a measure of the ocean bathymetry. Did you try to see if all of these features can be condensed into fewer input features by applying some transformation. There exist physical relationships between some of these input features. Can you see if the statistical models can learn these physical relationships, without explicitly providing all of them. e.g. - do the feature selection test, but restrict to this group of features. how much of an incremental performance gain do you get when you add smoothed sea-floor roughness from when you just give sea floor roughness? Can you get the statistical model to learn the depth slope from the depth? In which case can you get rid of that as an input? These kinds of tests will be useful to be able to come up with a practical predictive model. Line 270 - 277: Same deal, do you need both the quantity and the gradient? Line 300, 323: Can you please explain why you use the annual mean T, S for the N^2 but use the monthly mean MLD? the annual mean stratification is meaningless when you have a seasonally varying mixed layer. For the summer months when the thermocline punches through the mixed layer, you will have deep mixed layers with strong stratification which doesn’t make physical sense. Line 327 - 335: Can you explain the physical motivation for splitting this into 2 variables. Can’t you just use sine(lat), which would give you both the sign and latitude? I would encourage the authors to take a look at Sinha and Abernathey (2021), where there is a discussion on using spatial information in geophysical data as input parameters for statistical modeling. Line 352: Feature (26)Can you explain the motivation for using this as a parameter? If I understand it correctly, this is the depth from the base of the mixed layer to the ocean bottom. So if you have bathymetry, and MLD, isn’t this just a linear combination of the 2? Line 412: “...assume that each prediction has the same uncertainty”. Can you explain the physical motivation for this assumption. Also have you looked at the distributions of the input variables and the target. That might tell you if you need to apply any transformations to the data (either the features or the labels that you are trying to predict). Line 428 - 437, 539: I like this. This is nice! This ensures that the model is not just learning local information and using that for prediction. But I have a question. When you make the prediction from the 44 different blocks and average those predictions for each box and then do the concatenation do you not see blockiness / discontinuities at the block edges. I assume that the predictions for each block are independent and so there’s bound to be discontinuities even with averaging. Can you comment on that? Its hard to see anything from the figures. Lines 442 - 466: This is a nice approach for determining relative feature importance but I would like to see this explored a little more in the results section. Maybe present a figure that summarizes the findings in a succinct way. I would like to refer you again to Sinha and Abernathey 2021 where they did a similar feature dependence study. You can also group together some of the physically related input features and look at the parameter sensitivity within each group. That would be an interesting finding. It would tell us about the information criteria for training a statistical model. You can also assess sensitivity to perturbation in the inputs like in Sinha&Abernathey. General comment about the figures: Can you make them bigger and use a different colormap than “viridis”. It is hard to make out any mesoscale features in these figures with this colormap. Table 1 label (Line 610). “higher importance.” Line 619 - 621: It is not clear to me what is meant by this sentence. Can you please rephrase this. Line 702 - 703. Do you think this could have something to do with the fact that you are using annual mean stratification and monthly mean MLD. Line 704 : “The Mascarene Ridge is ...”. This feels like an abrupt change of topic. Does not flow with what is being discussed immediately before. Start a new paragraph and maybe mention why you are talking about this. Only when you get to Line 715, it becomes clear why this is important for the discussion. Lines 723 - 726. Maybe you can include some more citations here. e.g. Partee et al 2021, Sonnewald et al 2021, Sinha & Abernathey 2021 etc. Line 734: You don’t have to do it for this paper but perhaps you can also mention how deep learning / ANNs can also be a potential class of models to explore or conversely discuss why they might not be useful/meaningful for this application. Lines 737 - 745. This again feels slightly disjointed from the discussion that precedes it. Maybe discuss why / how this affects the outcome of the present study to connect it to the rest of the discussion. Lines 772 - 773: This is a nice statement and I personally agree but maybe its not the best note to end on. Perhaps you can discuss how this study is a step in the direction of doing better physics informed ML. 2. This manuscript attempts to build a data-driven model for the Sea Surface Slope (SSS) variability, represented through the median of along track SSS values derived from Satellite altimiters within a region of space (here a 7min by 5min box) , as a function of large class of other remote sensed and climatologically derived physical quantities. The primary ML model is a LightGBM model with a baseline model being simple linear regression (LR), the former achieving an R2 of 56.6% globally over the LR accuracy of 36%. The spatial maps show that a small number of regions have high errors which might be the cause of the low predictability though this isn’t entirely clear. Still the results are interesting though rather non-trivial, especially some of the key drivers remain unexplained. Overall I have mixed feelings about this work. I really like a lot of the careful data curation that has been done here (though some of the choices and their impact on the modeling are unclear). For example, the SSS variability maps for the 30-100 scales are very interesting themselves. Normally this is hardest part, with the feature selection/engineering and modeling being secondary. While some of the feature selection is motivated, other choices seem poorly so (which might be simply my lack of understanding). Though this is not in itself a concern because the kitchen sink approach can be pretty successful in ML. However, I have a few issues with the cross-validation (CV) approach used in this work, which is possibly due to the fact that I don’t quite get it, presumably because it is written in confusing fashion (detailed below). Overall, I think this is nice work, worth publishing. But I am going to leave it to upto the authors to decide as to whether they think major changes are warranted based on my questions below (since they could be simply related to my lack of understanding the authors’ approach). It is possible that these can be addressed by minor changes in the writing. • Cross-validation: First for this kind of tabular data regression problem (i.e. the input covariates have no spatio-temporal structure and the modeling framework is entirely local, here at a 7’x5’ box), the primary step is normally to completely mix the data (so if you create a pandas dataframe of the entire dataset, you would do df = df.sample(frac=1, random_state = seed).reset_index(drop=True) and then apply 5 or 10-fold CV on this case. Note that because the data is uniformly mixed, the test and train data for each fold of the CV would likely have the same distribution (especially here where the input and target values are known in every region of the globe. This training is therefore in-distribution but out-of-sample - you are essentially building a model that identifies nonlinear relationships across the entire dataset. If CV accuracy metric, say R2 is high (in the 80s or 90s) then you can claim that you have identified the right input variables which model your target globally, here the SSS variability. After this step one might design complex CV strategies which no longer mix the data 1 and can lead to out-of-distribution(OOD) issues (and correspondingly lower R2 val- ues) because even with a high CV in the uniformly mixed dataset there is no reason that a model learned on one region might apply to another region (Note: I missed reading that the authors discussed this in 715-726 but this point is actually missed there: distinguishing between whether you are OOD or missing in- put variables requires different strategies as outlined above and below. In the current strategy, the authors cannot distinguish beteween OOD prob- lems and insufficiency of chosen input variables for modeling.). For example you might train on some ocean basins and test on other. The authors, instead, apply a very detailed CV approach where they breakup the globe into 64-blocks and use each block in turn as test and the remaining as train (though there is another sampling approach within this, more on that below) (lines 25- 440). This is essentially a 64-fold CV approach without mixing the data uniformly and is a great strategy! However, in lines 454-457, the authors claim to use a 5-fold CV approach and how are the folds chosen? I am also confused as to how these two different CV approaches are used together and which CV strategy is the accuracy reported on. • While I think this is is a unnecessarily complicated CV approach for plain old LR, we can move on to LGBM which is really where these strategies make a difference (lines 480-487). Here they say that "44 out of the remaining 63 blocks as training datasets, and the other 19 blocks as validation datasets". This is perfectly fine and desirable, but do the authors use the 30 times resampling strategy here too? Meaning the authors trained 30x63 LGBM models? If so, that seems a bit overkill. Also did the authors check the variability within each of the 30 resamplings? Because this variance can actually be used to quantify the model’s epistemic uncertainty: since you average the predictions of the 30 models per block, the variance of the predictions can be a measure of model uncertainty at each data point. Finally, the sampling strategy within their 64-fold blockwise CV approach is an inter- esting one but not motivated enough and seems to skip some steps. First why haven’t the authors tried a case where they do a proper 64-fold CV without homogenizing the dataset (to ascertain OOD issues) i.e. each block in turn is test and the rest of the 63 blocks as train? Then you can try sampling strategies like the one the authors chose (also why 30 times?). If the authors have tried all these initial steps before arriving at their present strategies, then the authors should write an appendix describing all the things they tried and why they arrived at the present approach. The chronology of research is irrelevant in many scientific fields but ML is a primarily experimental field and reporting this can be very useful for future researchers. • Feature importance (FA) (ignoring LR again): The authors have two FA approaches, 2 one is the inbuilt LGBM approach which measures total gains of splits which use the feature and the second is a ablation approach. The former approach is not actually that transparent which is presumably why the authors have added a single line about it, then why do the authors consider it their primary approach to ascertain FA? Also for both approaches you can quantify the FA instead of simply ranking, so why have the authors not chosen to show that? With ablation you can show the R2 as a function of the added features so we can know the FA quantitively rather than simply through a rank; LGBM provides metrics for feature importance which can be easily plotted. Also how much of the R2 do, say the top 7 features in the ablation FA manage to achieve? I am assuming the authors already have this info and should really report them. Because if it close to 56% then you have the primary features already. • The authors note and discuss a few specific regions which have pretty high prediction error. These specific regions occupy a small fraction of the global ocean, so the authors should report the R2 when simply not considering these regions (note, no training or anything, just the R2 calculation of the remaining regions). This will help readers ascertain how much these outlier regions are skewing the model accuracy. • Given that the authors repeatedly use quantiles, including the median for their SSS variability, why do they use error metrics to be L2 losses (equivalent to maximum likelihood assumption on a assumed Gaussian likelihood)? Furthermore, given the outlier regions, maybe an L1 loss is more appropriate though recomputing everything might be excessive so I’m not suggesting this here, just for future studies. • Perhaps just my issue, but 7 by 5 minutes should be replaced by 7 by 5 arc minutes the first time it is mentioned with a bracketed explanation for 1deg = 60 minuites. I had a brief moment of confusion trying to decipher this. • SWH should be in a bracket next to the corresponding subtitle. When not reading a paper in order, searching for a specific abbreviation should lead to the direct definition somewhere.