====================== Reviews and responses: ====================== RE: MS#: AA-D-19-01553 "The Performance of an Artificial Neural Network Model in Predicting the Early Distribution Kinetics of Propofol in Morbidly Obese and Lean Subjects" Dear Dr. Ingrande: Thank you for submitting your manuscript "The Performance of an Artificial Neural Network Model in Predicting the Early Distribution Kinetics of Propofol in Morbidly Obese and Lean Subjects" to Anesthesia & Analgesia for consideration. Your manuscript has been reviewed by our editorial board and outside experts. Based on their reviews and my own reading of the manuscript, your article is not acceptable for publication in Anesthesia & Analgesia in its present form, but I would be happy to receive a revised version. Please see my comments below: Executive Section Editor Comments to the Author: This manuscript was an interesting but challenging read. It covers a lot of material and complex concepts. I anticipate it will require a substantial revision to be suitable for publication. I have made a few comments to improve clarity for your consideration. Reviewers and the statistical team have also provided extensive and useful comments. A. For added clarity, consider adding: 1. A concise hypothesis in the context of primary outcome measure(s). The hypothesis is first described in the discussion (page 21, line 41). Consider adding it to the introduction. Thank you for this suggestion. We have changed the introduction to the following (P7 L29): “The objective of this study is to compare the performance of a compartmental model, recirculatory model and an ANN to describe propofol pharmacokinetics from a frequently sampled dataset. We hypothesize that the ANN will have better performance because of its ability to model complex non-linear systems without assuming a particular structure.” 2. A sample size plan. Explain what would be an appropriate sample size to build then validate the ANN, re-circulatory, and compartment models. If your sample size was inadequate, how should readers interpret your findings? As preliminary and more data are needed before the model can be used in a clinically setting? We did not perform an a priori power analysis as this prospective PK study was exploratory in nature, and we could not predict the number of parameters contained in the final models. Our sample size (including number of subjects and observations per subject) is larger than most prospectively collected PK datasets and in line with similar studies. We do believe that our sample size was appropriate to build and validate our recirculatory and compartmental models. The number of observations was sufficient to build and validate the ANN. ANN’s have been shown to perform very well when given extremely large amounts of data, amounts that are generally not conducive to collect during a prospective PK study. We do feel that performance of the ANN may have eclipsed that of the mixed models if given a larger data set. The limitations of the small dataset (small for purposes of using an ANN) was detailed within our discussion (P22-23) and discussed below as a reason why ensembling was used (avoidance of over-fitting). However, results of this study can be considered hypothesis generating, i.e. can ANN’s offer better performance in modeling larger (pooled) PK datasets (P23 L42). 3. How you arrived at a sample size of 20 obese and 10 lean patients. Did 24 subjects and 1140 (or is 1128? check for consistency) observations provide enough data? Thank you. We did not perform an a priori power analysis, as this is uncommon in PK studies since they are in general, exploratory in nature (please see above response). However, the number of subjects and number of observations is large and in line with prior PK studies of propofol (Masui et al. Early phase pharmacokinetics but not pharmacodynamics are influenced by propofol infusion rate. Anesthesiology 2009 Oct;111(4):805-17; Knibbe et al. Population pharmacokinetic and pharmacodynamics modeling of propofol for long-term sedation in critically ill patients: a comparison between propofol 6% and propofol 1%. Clin Pharmacol Ther. 2002 Dec;72(6):670-84) Based on these prior studies, our group determined that 30 subjects (47 samples per for a total of 1410) is considered data-rich for an exploratory PK study that aims to construct a model and evaluate covariates. We apologize for the ambiguity regarding the number of observations in this study. In the results section (P17 LM9-16) we report that 30 subjects were enrolled each contributing a total of 47 samples for a total of 1410 samples. We do report that 6 subjects were excluded, meaning we analyzed 24 subjects (47 samples each) for a total of 1128 observations (as reported on P22 LM 54). 4. More clarity to what data was used to train the ANN and what data was used to internally validate your model. Thank you. Please see the comments for Reviewer #4 regarding this. I have summarized them here: The cross-validation setup consisted of 22 data points and 11 folds. During each fold, 18 points were used for training, 2 for validation, and 2 for testing. In this way each data point appears in a test fold exactly once, such that we can accumulate a test error that accounts for all of the data points. During each fold, the validation set was used to select the best model (i.e., training was terminated once the validation error did not improve further). 5. More clinical emphasis on compartment model mispecifiations. Beyond a numerical assessment of model performance, what clinical metrics have been used to describe how poorly existing compartment models perform when used in patient care? With that in mind, what criteria would be considered a clinically useful improvement (beyond statistically better)? Did the new ANN model meet that criteria? We agree that translation of our results (and of any PK model) is ultimately necessary to evaluate its clinical performance. This study did not prospectively validate the models to assess their true clinical performance or provide an assessment of how their misspecification may affect clinical administration of propofol. Anesthesiologists are quite good at administering propofol safely and effectively and currently available TCI pumps are likewise capable of safe administration. This study was not designed to establish clinical superiority of one model over another. However, we can and have concluded that the performance of neural networks in modeling prospectively collected pharmacokinetic data is limited by the size of the data. This is a key limitation in the clinical utility of such models and outlined as such in our discussion (P22 LM 33-54). We have added to the discussion (P23 paragraph 2) the following that emphasizes that our models have not been clinically validated for performance: “None of our models have been prospectively validated to assess their clinical performance. However, we performed simulations of a standard induction dose of propofol (2 mg/kg given over 10 seconds), and compared these to a clinically validated model (Supplementary Figure 5). Concentration-time profiles between the four-compartment and GRU models as well as the model published by Schnider et al. were similar. All three models demonstrated peak concentrations that were higher than the recirculatory model.” 6. Simulations of clinical scenarios of model driven propofol administration to illustrate model performance differences between the three models studied. We thank the reviewer for the excellent suggestion of performing simulations for evaluation of each models performance and for comparison against previously published propofol PK models. Please see above response. B. The methods mention measuring cardiac output using noninvasive methods. Was this data used as a covariate in any of the model building (it is not mentioned in the results). If not, consider removing. Yes, cardiac output was measured noninvasively in all subjects and was tested as a potential covariate in our models. We will have added to the results (P17) that this covariate was tested during model building. C. TBW & LBW were used. Were other weight scalars considered? Why/why not? TBW and LBW were the only weight scalars used in this study. TBW and LBW provide objective, quantifiable measurements of subjects’ physiologic body composition. Our prior study showed that LBW is an optimal dosing scalar for administering an induction dose of propofol (Ingrande J, Brodsky JB, Lemmens HJ. Lean body weight scalar for the anesthetic induction dose of propofol in morbidly obese subjects. Anesth Analg 2011 Jul;113(1):57-62). In this study, propofol dose per kilogram LBW in morbidly obese patients was similar to dose per kilogram TBW in lean patients. Propofol was given to lean patients based upon TBW because this is the standard dosing scalar in these patients and, by definition TBW and LBW are closely approximated in lean patients. D. Pointing out that you are the first to publish an idea may be construed by readers as not that compelling. Consider removing such statements and let the innovation speak for itself. We apologize for this type of statement and agree with the reviewer. We have removed the statement “Our study is the first to compare the performance of an ANN to conventional MEM’s in characterizing this type of data,” from the manuscript (P23 L31- 34). Statistical Editor Comments: This manuscript was reviewed by a statistical reviewer (#4) who makes very important points about the statistical methods and reporting. Authors should follow these recommendations and also show in detail how they have addressed each point. If you can address point-by-point my comments and those of the reviewers (see below), I will be happy to receive a revised version of the manuscript. However, I cannot promise that your revised version of the manuscript would achieve the priority necessary for publication in Anesthesia & Analgesia. If we do not receive a revised manuscript from you within the 8 weeks, or a letter from you indicating your indication of sending a revised manuscript, I will assume that you have elected to decline to revise your manuscript. If you choose to revise your manuscript, please submit your revision via Editorial Manager by logging in to your author account and clicking the link "Submissions Needing Revision." Be sure you have pasted your response to the reviewers into the appropriate box on the online submission site. With all good wishes, Ken B. Johnson, MD Executive Section Editor Anesthesia & Analgesia --- Jean-Francois Pittet, MD Editor-in-Chief Anesthesia & Analgesia ******************************************************* Reviewer Comments to the Author: Reviewer #1: The authors compared the initial pharmacokinetics of a profofol infusion using 3 different pharmacokinetic models: 1. 4 compartment, 2. Re-circulatory and 3. An AI model using long short-term memory and a gated recurrent unit. Blood was drawn from 3 subjects using a closed loop system frequentlyfor up to 16 hours. The AI model and the re-ciculatory models out performeed the 4 compartment model. Major concerns: In the abstract and throughout the manuscript the authors refer to high-resolution sampling and data set. I believe that they are most likely suggesting a high sampling rate of the blood samples and not on the resolution by which you are preforming your analysis. Base on the Nyquist Sampling Theorem: A band-limited continuous-time signal can be sampled and perfectly reconstructed from its samples if the waveform is sampled over twice as fast as it's highest frequency component. Is this true of the rate of sampling? What variable that is important in this waveform has the highest frequency? Maybe heart rate (stroke volume). I think it would be better to drop the high-resolution and refer to data as having a frequent sampling rate. Thank you for this comment. We agree with the reviewer that based on the Nyquist Sampling Theorem, calling our sampling scheme “high-resolution” is a poor descriptor. We have dropped “high-resolution” in favor of “frequent sampling rate” per the reviewer’s suggestion. Abstract pg4, ln 59 - slightly has no quantitative meaning, please provide an estimate of bias or another metric that demonstrates over prediction. We thank the reviewer for this suggestion. We have removed the word “slightly”. Pursuant to the point below, we have reported mean bias error and root mean square errors for the four models, for quantitative measurements of bias and accuracy. We have also reported the confidence intervals of these measurements per the reviewer’s request (see response below). Pg5, ln 8; I don't see any estimate of under prediction bias in the manuscript. Provide a quantitative estimate of the under-prediction and the confidence intervals of the bias estimate. Thank you. Inspection of figures 2 and 4 demonstrate the under-prediction bias seen in the four-compartment model after the first 5 minutes. We agree that the overall mean bias is positive. Per the reviewer recommendations, we have included the confidence intervals of the bias, which demonstrate, quantitatively, that this model suffers from both negative and positive bias. We have changed P5, LM 8 to read: “which suffered from over-prediction bias during the first 5 minutes followed by under- prediction bias after 5 minutes.” Pg 6 ln-33 - Low resolution refers to not sharply defined. Again I think this is in reference to sampling frequency. Also, once you exceed the Nyquist Sampling Rate you rarely gain much in terms of reconstruction of the signal. How do you know you did not over-sample? Thank you. We have dropped “low resolution” in favor of “infrequent blood sampling” (P6 L33) Regarding the question of over-sampling, this study was designed to sample frequently enough to capture the peak propofol concentration and fast decline in concentration immediately after the slow bolus administration of the drug. (Fisher D. Almost everything you learned about pharmacokinetics was somewhat wrong. Anesth Analg. 1996 Nov;83(5):901-3). Our sampling strategy has been modeled after previously published studies (Masui et al. Early phase pharmacokinetics but not pharmacodynamics are influenced by propofol infusion rate. Anesthesiology 2009 Oct;111(4):805-17) Pg 9; ln 22: Please define obese We apologize for this omission. We have provided body mass index criteria for the morbidly obese and lean groups. P9 L 22 reads: “Thirty subjects were enrolled (20 morbidly obese, body mass index ≥40; 10 lean, body mass index <25).” Pg 12: Equation 2: I believe the eij term is missing in the equation. Thank you. This has been added to equation 2. My primary problem with the manuscript is that the estimates of bias and goodness of fit chosen the MSE and the MPE are not what I would consider the standard measures of these constructs. I believe that the mean bias deviation or the mean bias error would be a more appropriate estimate of the model bias and that the confidence intervals could be constructed so that a statistical estimate of the difference from zero could be made and presented as a quantitative estimate of the bias, rather than just visualization as is currently reported. Likewise I believe the investigators should report the RMSE should be reported as a measure of goodness of fit as a measure of accuracy. We would kindly ask the reviewer to look back at equation 3 in our manuscript (P13; LM4) to see that mean prediction error (as reported in our study) is the same calculation as mean bias error (which the reviewer has requested). The equation of mean bias error is identical to that of mean prediction error. Please refer to the equation for mean bias error as described by T. Raventos-Duran et al. Structure-activity relationships to estimate the effective Henry’s Law constants of organics of atomospheric interest. Atmos Chem Phys. 2010 10;7643-7654. In addition, the reviewer is asking for the RMSE, which is simply the root of our reported MSE. MPE and MSE are indeed standard measures of bias and accuracy, respectively. Both of these metrics have been used in previously published studies comparing mixed effects models and neural networks (Chow HH, Tolle KM, Roe DJ, Elsberry V, Chen H: Application of neural networks to population pharmacokinetic data analysis. J Pharm Sci 1997; 86: 840-5; Chow HH, Tolle KM, Roe DJ, Elsberry V, Chen H. Application of Neural Networks to Population Pharmacokinetic Data Analysis. J Pharm Sci. 1997 Jul;86(7):840-5) We have added a supplementary table including the MPE, MSE, Objective Functions (for the mixed-effects models) and the confidence intervals of the bias and precision per the reviewer’s request. Please provide 95% confidence intervals for the curves in figures 2 and 3 for the smoothed lines in figures 2 and 3. Figures 2 and 3 provide a lowess smoother of the direction of the bias. By definition, the smoother is not a model fit, hence there can be no calculations of confidence intervals from these plots. These plots were included because they provide a graphical representation of general bias direction versus time and observed concentration. These plots have been published by our group in a prior study published in Anesthesia and Analgesia (Ingrande et al. Pharmacokinetics of cefazolin and vancomycin in infants undergoing open-heart surgery with cardiopulmonary bypass. Anesth Analg. 2019 May;128(5):935-943) The Visual Predictive Checks provide a robust graphical measurement of the ability of the model to reproduce the variability in the observed data. These plots are presented in Supplementary Figures 3 and 4 and provide the prediction interval in the variability that we believe the reviewer is asking for. The visual predictive checks demonstrate that only a small percentage of the data falls outside of 5 and 95% confidence intervals. Reviewer #2: This experimental study was performed to derive and compare the performance of 3 types of models (traditional mammillary compartment model, a recirculatory model and a gated recurrent unit neural network) during early and late propofol pharmacokinetics in morbidly obese and lean subjects. The authors hypothesize that since artificial neural networks are devoid of structure, they offer advantages over the other approaches in modeling complex non-linear systems. PK data was obtained from 17 morbidly obese and 7 lean subjects receiving propofol for induction of anesthesia. Main results show that the final recirculatory model and the gated recurrent unit neural network had similar performance. Both models, however, tended to over-predict propofol concentrations during the induction and elimination periods. Both models showed superior performance compared to the four- compartment model which showed under-prediction errors. The relatively small dataset of this PK study was a limitation to adequately train the neural network model. In my opinion this is an interesting study assessing the capabilities of neural network modeling in complex PK scenarios. However, the fact of adding obese and lean subjects adds an additional complexity that needs to be better explained in their modeling analysis. Major concern: 1) Previous propofol PK studies in obese and lean patients have consistently found that volumes and clearances increase with weight and therefore size scaling most commonly needs to be incorporated in model parameters. As far as I can see the authors did not incorporate size covariates in their 4-compartment and recirculatory models. I think it will be important to better explain how size affected model parameters in these models. As I understand lean body weight and total body weight (and other covariates) were incorporated in the neural network model. Please clarify. Thank you for this inquiry. We do agree with the reviewer that other previously published models do indeed incorporate size descriptors. We did explore the relationship between covariates and PK parameters in the 4-compartment and recirculatory models including size covariates as the reviewer mentioned. To clarify this point, we have added the following statement to page 17 LM 29: “Covariates analyzed included, TBW, LBW, age, cardiac output, and gender.” We changed P17 LM53 to read: ” Analysis of the relationships between model parameters and measured covariates (TBW, LBW, age, cardiac output and gender) revealed a positive linear relationship between V4 and V5 with age greater than 65 years.” None of these covariates had a significant impact relationship and did not improve model fits when incorporated into the models in forward selection manner. We therefore left the covariates out of the model and presented the simplest model as our final model. To provide better clarification that the covariates were not included in the accepted final model we added (P18L6): “The combined error recirculatory model (without covariates) was therefore accepted as the final model.” 2) I could not find parameters estimates for the 4 compartment final model. We apologize for this omission. We have now incorporated these estimates in Table 1 per the reviewer’s suggestion. Minor concern Please clearly state in predicted vs observed concentration diagnostic plots if they are population or post hoc estimates. Thank you. We have changed the title of Figure legends 2 and 3 to clarify that these are population estimates. We have changed the title to Figure 2 to read: “Propofol Observed/Population Predicted Ratio vs. Time” Reviewer #3: Ingrande et al submit an original research article detailing the performance of an artificial neural network model in predicting the early distribution kinetics of propofol in morbidly obese and lean subjects. The authors found similar performance between a recirculatory model and GRU neural network. The authors observed plasma concentrations of propofol up to 16 hours following induction via infusion. Given the short duration of action of propofol, the concentrations and ability of the models to predict plasma concentrations are likely not clinically relevant. Overestimation of propofol dosing is unlikely to lead to harm and most clinicians (as evidenced by the author's own methods) titrate medications for induction to effect. Given the amount of data on the pharmacokinetics and dynamics of propofol, I am unsure what the use of another model would provide clinically although I acknowledge that it may yield some knowledge regarding the PK/PD modeling of other drugs which may have truly been the authors primary goal. Additionally, it is unclear to me why TBW was used in the lean group instead of LBW as these weights are different. This study is not meant to be a “me too” study describing another PK model of propofol. Neural networks have shown promise and have won praise for their ability to model complex and large datasets and have been proposed as a possible replacement for conventional mixed effects models (Gambus P, Shafer SL. Artifical Intelligence for Everyone. Anesthesiology 2018;128:431-433). What is unknown is how well neural networks perform when modeling prospective PK data. This study was designed to evaluate the performance of neural networks and compare their performance to conventional mixed effects models in modeling a large, frequently sampled pharmacokinetic data set. Results of our study demonstrate the limitations of neural networks when analyzing smaller datasets. We chose to administer propofol to obese patients (BMI >40) based on LBW because our prior study found that LBW is an appropriate an optimal dosing scalar for this drug. (Ingrande J, Brodsky JB, Lemmens HJ. Lean body weight scalar for the anesthetic induction dose of propofol in morbidly obese subjects. Anesth Analg 2011 Jul;113(1):57-62). By definition, TBW and LBW are closely approximated in lean patients. In our prior study, propofol dose per kilogram LBW in morbidly obese patients was similar to dose per kilogram TBW in lean patients. Specific issues to be addressed by the authors: Abstract: P4L19- I believe you mean to say "in that they are devoid" Thank you for this correction. Per reviewer #4’s comments, we have changed this statement to read: “in that they are flexible and not limited to a specific structure and therefore may be superior in modeling complex non-linear systems.” Introduction: It might be useful to describe what specific clinical problem or question your study attempts to answer. Given the overwhelming pharmacodynamic and pharmacokinetic data on propofol, what is the purpose of these improvements? We appreciate the reviewer’s suggestion. We agree with the reviewer that there are numerous PK models of propofol. The primary objective of this study was to evaluate the performance of a neural network in modeling this high frequency dataset and compare the performance to mixed models. We have added to the introduction the following (P7 L29): “The objective of this study is to compare the performance of a compartmental model, recirculatory model and an ANN to describe propofol pharmacokinetics from a frequently sampled dataset. We hypothesize that the ANN will have better performance because of its ability to model complex non-linear systems without assuming a particular structure.” P7L35- Please conclude the introduction with a statement of specific primary and any secondary outcomes as well as your study hypothesis. We thank the reviewer for this excellent suggestion. Per the response above we have concluded the introduction with our hypothesis to read as follows (P7 L29): “The objective of this study is to compare the performance of a mammillary compartmental model, recirculatory model and an ANN to describe propofol pharmacokinetics from a frequently sampled dataset. We hypothesized that the ANN would have better performance because of its ability to model complex non-linear systems without assuming a particular structure.” Methods: P8L21- What was considered obese? What was considered lean? Please specify in the methods. We apologize for this omission. We have included a definition of obese and lean subjects.. P8L21 reads: “Thirty subjects were enrolled (20 morbidly obese, body mass index ≥40; 10 lean, body mass index <25).” P8L24- What surgeries were the participants undergoing? Were these all approximately the same duration? Specifically, were there any surgical factors that might have affected drug metabolism or excretion. We have added to the methods (P8 L24): “Obese patients underwent elective laparoscopic sleeve gastrectomies, gastric bandings and Roux-en-Y gastric bypasses. Lean patients underwent a variety of elective general, plastic, gynecologic, and ENT cases.” The majority of cases were of similar duration (approx. 2-3 hours) and we do not believe there were any surgical factors (blood loss, fluid shifts, reductions in liver or kidney blood flow) to affect drug metabolism or excretion. Patients were excluded from the study if they had any history or evidence of hepatic, or renal disease. P8L55- Was arterial line placement considered standard of care in the care of these patients? The placement of the arterial line was specific to this study protocol. We have clarified by adding the statement “Per study protocol” to the manuscript. P9L26- As calculated LBW and TBW differed in the lean group, what was the rationale in using TBW in this group as opposed to using LBW for both groups. Please explain. Our prior study demonstrated that LBW is an appropriate weight based scalar for morbidly obese subjects and that obese subjects given propofol based upon LBW required similar amounts of drug compared to lean subjects given propofol based upon TBW. Please also refer to our above response and to our response to the section editor. P11L17- Was an a priori power analysis performed? If so, please specify or state how the sample size was determined. Thank you for this inquiry. We did not perform an a priori power analysis. Power analyses are uncommon for PK studies, as they are, in general, exploratory in nature. This study was similarly exploratory in nature, and we were therefore unable to predict a priori how many parameters would be contained in our final model(s). The number of subjects and number of observations per subject are in line with a prior PK study which also employed frequent blood sampling (Masui et al. Early phase pharmacokinetics but not pharmacodynamics are influenced by propofol infusion rate. Anesthesiology 2009 Oct;111(4):805-17) and other previously published PK models of propofol where the data was prospectively gathered (Knibbe et al. Population pharmacokinetic and pharmacodynamics modeling of propofol for long-term sedation in critically ill patients: a comparison between propofol 6% and propofol 1%. Clin Pharmacol Ther. 2002 Dec;72(6):670-84). Results: P17L28- Please define these abbreviations when first used. We apologize for this omission. We have defined the abbreviations in question and have included the statement on P17 L28: “Objective function (OBJ), mean prediction error (MPE) and mean square error (MSE) are reported (Supplementary Table 1). Discussion: P20L9- Please begin the discussion with the primary results of this study. Thank you for this suggestion. We have rearranged the discussion and it now begins with the primary results of this study. P20L9 reads as follows: “Artificial neural networks (ANN’s) have been praised for their ability to model complex non-linear data and have been proposed as a possible replacement for mixed effects models.8 This study aimed to evaluate the performance of a conventional mammillary compartmental model, a recirculatory model and an ANN in characterizing the PK of propofol using a frequently sampled prospectively collected dataset. A GRU model had comparable performance to the recirculatory model, with both having better performance compared to the four-compartment model.” Reviewer #4: It's not clear that the authors approach draws a fair comparison of the PK methods versus the ANN in predicting drug concentration. The former models were fitted using traditional criteria (-LL) whereas the latter were fitted by optimizing a cross-validated estimate of predictive performance, which would seem to advantage the ANN method relative to the PK method, with regard to predictive performance. It is accurate that using a careful optimization approach confers advantages over traditional approaches. However part of the goal of this paper is simply to demonstrate the advantage of using "modern" machine learning techniques, which are capable of directly optimizing the error metrics. We show how these techniques can be applied to predict drug concentration, following current practice in machine learning. Our models and experiments are also intended to give a sense of how much training data is required to confer an advantage, and to demonstrate how we deal with issues of small datasets (which require careful cross-validation approaches, among other things). In addition, the ensemble approach confounds the ability to directly compare the approaches, since the ensemble approach uses input from the PK models. It seems it would thus be impossible to do worse with the ensemble than with the PK model. Such an ensemble method is guaranteed to do better than the PK model only at training time. At test time the model is susceptible to overfitting, and it is not guaranteed to outperform the PK model. Ensembling was necessary to avoid overfitting by the ANN. The ANN model is highly flexible and quickly overfits to the small datasets used here. The use of the PK method acts more like a form of regularization -- it ensures that the ANN model starts from a reasonably good solution, and does not get stuck in weak local optima (or otherwise local optima that are strong on the training set but weak on the test set). This was necessary given the small datasets we work with. We expect this would not be the case if larger datasets were available and that no ensembling would be necessary. In general, I feel that the authors' descriptions and documentation of the ANN approaches in particular are likely insufficient. Providing some code (Keras/TensorFlow?) may help with this. A link to the code is provided: http://jmcauley.ucsd.edu/propofol_dec19.html. We have referenced this link on P 18 L24: “ Tensorflow code for the GRU model can be found at at http://jmcauley.ucsd.edu/propofol_dec19.html.” P4L19: I feel that "devoid of structure" does not convey the authors' intention, which is that neural networks are much more flexible at modeling the relationship between input (drug administration) and output (drug concentration), relative to conventional parametric PK models. I recommend the authors consider alternative wording here. We apologize for this vague and incorrect wording. We have changed this wording (P4L19) to read “in that they are flexible and not limited to a specific structure and therefore may be superior in modeling complex non-linear systems.” P4L54: I believe that the numbers listed for mean prediction error are not in the correct order. The authors write that the recirculatory model and GRU model both slightly over predicted, but the signs of the mean prediction error are different. Please revise. Thank you and please excuse this error. The MPE for the recirculatory and GRU models should read 0.348 and 0.161, respectively. P13L40: It's not clear here what goodness of fit plots were considered. Please clarify in the text. Thank you. We have clarified this statement to read: “Goodness of fit plots, including the ratio of observed to predicted concentrations vs. time, population predicted vs. observed concentrations, and overall model fit, were performed for model evaluation.” P13L49: What was the purpose of these plots. And how was inestimability of THETA determined? Also, please clarify the meaning of THETA here, as some readers may not be familiar with NONMEM and the naming convention used there. Log likelihood plots are a graphical diagnostic of model overfitting. Briefly, the parameter being tested is changed in small increments (iterations) while the other parameters of the model are fixed. The plots ensure that 1: the model can converge around a minimum value (-2LL), graphically viewed as a parabolic-like shape, and 2: the convergence is ideally near the original parameter estimate. We agree with the reviewer that the meaning of THETA should be defined. To define the meaning of THETA, we have changed the sentence to read: “After identification of the final model, log-likelihood plots were performed for each model parameter estimate (THETA).” P13L49: Some justification of the sample size is required by journal policy. We thank the reviewer for this inquiry. Please see responses #2-3 to the Section Editor and response #9 for Reviewer #3. In summary, we did not perform an a priori power analysis, as this was a prospective, exploratory PK analysis. Power analyses are uncommon in PK studies, and we remind the reviewer that we recently published a PK study in this journal without reporting a power analysis (Ingrande et al. Pharmacokinetics of cefazolin and vancomycin in infants undergoing open heart surgery with cardiopulmonary bypass. Anesth Analg. 2019 May;128(5):935-943). The sample size of this study is in line with previously published studies analyzing the PK of propofol Masui et al. Early phase pharmacokinetics but not pharmacodynamics are influenced by propofol infusion rate. Anesthesiology 2009 Oct;111(4):805-17; Knibbe et al. Population pharmacokinetic and pharmacodynamics modeling of propofol for long-term sedation in critically ill patients: a comparison between propofol 6% and propofol 1%. Clin Pharmacol Ther. 2002 Dec;72(6):670-84). P13L58: The bootstrap internal validation method needs additional description in the text. Did the authors use a bootstrap that considers the correlation among repeated samples? What metrics were validated? Was this done by evaluating each new fitted model (x 1000) on the original data? Were visual predictive checks performed for each of the 1000 fits? Was there a mechanism to correct for overfitting of the PK models? Thank you. We have expanded this section (P13L49 to P14L9) to provide more detail into the boostrapping, VPC, and log-likelihood plot methods. It now reads: “After identification of the final model, log-likelihood plots were performed for each model parameter estimate (THETA). If a parameter that was included in the model could not be reliably estimated from the data, the model was assumed to be overfit and the model was rejected. Bootstrap analysis of the final model was performed for internal validation. The bootstrap was performed by first creating 1000 new datasets of the same length as the original dataset by resampling at random from the observed dataset. The final model was fit to each dataset, and the distribution of the parameter estimates (THETA’s) examined, ultimately providing mean, median and percentiles of each of the parameter estimates in the model. Prediction-corrected visual predictive checks were performed to graphically assess whether simulations of each model can reliably predict the central trend (50th percentile) and variability (5th and 95th percentiles) in the observed data. Each model was used to create 1000 new datasets each containing simulated propofol concentrations. The 5, 50, and 95th percentiles of the simulated data were compared to the same percentiles of the actual observed data.” P14L47: "Before training, the hyperparameters (number of layers, number of nodes in hidden layer) were optimized via grid search" It's not clear how optimization can take place prior to training. Please clarify in the text here. The description of the overall ANN approach is not clear here. Were separate ANN's fit for each subject or combined (naively?). Why was the fold size selected to be 2? What metric was used for validation (mean square error)? We apologize for any mis-clarification. We have clarified these methods (P14;L47) per the reviewer’s request as follows: “The cross-validation setup consisted of 22 data points and 11 folds. During each fold, 18 points were used for training, 2 for validation, and 2 for testing. In this way each data point appears in a test fold exactly once, such that we can accumulate a test error that accounts for all of the data points. During each fold, the validation set was used to select the best model (i.e., training was terminated once the validation error did not improve further).” The ANN model can optimize either the MSE or the Relative Error (with some difficulty for the latter as the derivative can be somewhat ill-conditioned). When reporting a specific error we use an ANN model optimized for that error, and compute that error on the validation set to select the best model. P15L24: Please clarify how propofol dose differed from propofol administration rate. Was it cumulative? Propofol dose was the absolute amount (in milligrams) of propofol that was given to the patient. The administration rate was the rate that was programmed into the infusion pump. We have clarified P15L24 to read: “For the ANN, the input layer consisted of 9 nodes for each time step (gender, LBW, TBW, age, total propofol dose administered (mg), rate of propofol administration (mg/min)), 2 hidden layers (10 nodes per layer), and 47 outputs, each node representing propofol concentration at a distinct time point.“ P15L42: 'absolute difference' was the mean absolute difference used as the validation metric? If so, why use something other than mean square error? The mean square error was used as the validation error metric. We apologize for this error and the word “absolute” has been omitted and P15L42 has been changed to read: “--defined as the mean square error (Equation 4).” P18L35: "under-prediction" This seems contradictory with the reported positive mean prediction error for both of these models. Please explain. We apologize for this error and agree with the reviewer that the mean prediction error is positive, and Figures 2A (Panels B and D) clearly show an observed to prediction ratio that is less than 1, indicating that the predictions are higher than the observed values. We have corrected P18L35 to read “over-prediction”. P18L42: "over-predicted" It's not clear how to reconcile this with the previous sentences. We thank the reviewer for pointing out this error. There is an over-prediction as mentioned (P18L42). We have corrected the previous statement in question (as above).