======================
Reviews and responses:
======================
RE: MS#: AA-D-19-01553 "The Performance of an Artificial Neural Network Model in
Predicting the Early Distribution Kinetics of Propofol in Morbidly Obese and Lean
Subjects"
Dear Dr. Ingrande:
Thank you for submitting your manuscript "The Performance of an Artificial Neural
Network Model in Predicting the Early Distribution Kinetics of Propofol in Morbidly
Obese and Lean Subjects" to Anesthesia & Analgesia for consideration. Your
manuscript has been reviewed by our editorial board and outside experts. Based on
their reviews and my own reading of the manuscript, your article is not acceptable for
publication in Anesthesia & Analgesia in its present form, but I would be happy to
receive a revised version. Please see my comments below:
Executive Section Editor Comments to the Author:
This manuscript was an interesting but challenging read. It covers a lot of material and
complex concepts. I anticipate it will require a substantial revision to be suitable for
publication. I have made a few comments to improve clarity for your consideration.
Reviewers and the statistical team have also provided extensive and useful comments.
A. For added clarity, consider adding:
1. A concise hypothesis in the context of primary outcome measure(s). The hypothesis
is first described in the discussion (page 21, line 41). Consider adding it to the
introduction.
Thank you for this suggestion. We have changed the introduction to the following (P7
L29): “The objective of this study is to compare the performance of a compartmental
model, recirculatory model and an ANN to describe propofol pharmacokinetics from a
frequently sampled dataset. We hypothesize that the ANN will have better
performance because of its ability to model complex non-linear systems without
assuming a particular structure.”
2. A sample size plan. Explain what would be an appropriate sample size to build then
validate the ANN, re-circulatory, and compartment models. If your sample size was
inadequate, how should readers interpret your findings? As preliminary and more data
are needed before the model can be used in a clinically setting?
We did not perform an a priori power analysis as this prospective PK study was
exploratory in nature, and we could not predict the number of parameters contained in
the final models. Our sample size (including number of subjects and observations per
subject) is larger than most prospectively collected PK datasets and in line with similar
studies. We do believe that our sample size was appropriate to build and validate our
recirculatory and compartmental models.
The number of observations was sufficient to build and validate the ANN. ANN’s have
been shown to perform very well when given extremely large amounts of data,
amounts that are generally not conducive to collect during a prospective PK study. We
do feel that performance of the ANN may have eclipsed that of the mixed models if
given a larger data set. The limitations of the small dataset (small for purposes of
using an ANN) was detailed within our discussion (P22-23) and discussed below as a
reason why ensembling was used (avoidance of over-fitting). However, results of this
study can be considered hypothesis generating, i.e. can ANN’s offer better
performance in modeling larger (pooled) PK datasets (P23 L42).
3. How you arrived at a sample size of 20 obese and 10 lean patients. Did 24 subjects
and 1140 (or is 1128? check for consistency) observations provide enough data?
Thank you. We did not perform an a priori power analysis, as this is uncommon in PK
studies since they are in general, exploratory in nature (please see above response).
However, the number of subjects and number of observations is large and in line with
prior PK studies of propofol (Masui et al. Early phase pharmacokinetics but not
pharmacodynamics are influenced by propofol infusion rate. Anesthesiology 2009
Oct;111(4):805-17; Knibbe et al. Population pharmacokinetic and pharmacodynamics
modeling of propofol for long-term sedation in critically ill patients: a comparison
between propofol 6% and propofol 1%. Clin Pharmacol Ther. 2002 Dec;72(6):670-84)
Based on these prior studies, our group determined that 30 subjects (47 samples per
for a total of 1410) is considered data-rich for an exploratory PK study that aims to
construct a model and evaluate covariates.
We apologize for the ambiguity regarding the number of observations in this study. In
the results section (P17 LM9-16) we report that 30 subjects were enrolled each
contributing a total of 47 samples for a total of 1410 samples. We do report that 6
subjects were excluded, meaning we analyzed 24 subjects (47 samples each) for a
total of 1128 observations (as reported on P22 LM 54).
4. More clarity to what data was used to train the ANN and what data was used to
internally validate your model.
Thank you. Please see the comments for Reviewer #4 regarding this. I have
summarized them here:
The cross-validation setup consisted of 22 data points and 11 folds. During each fold,
18 points were used for training, 2 for validation, and 2 for testing. In this way each
data point appears in a test fold exactly once, such that we can accumulate a test error
that accounts for all of the data points. During each fold, the validation set was used to
select the best model (i.e., training was terminated once the validation error did not
improve further).
5. More clinical emphasis on compartment model mispecifiations. Beyond a numerical
assessment of model performance, what clinical metrics have been used to describe
how poorly existing compartment models perform when used in patient care? With that
in mind, what criteria would be considered a clinically useful improvement (beyond
statistically better)? Did the new ANN model meet that criteria?
We agree that translation of our results (and of any PK model) is ultimately necessary
to evaluate its clinical performance. This study did not prospectively validate the
models to assess their true clinical performance or provide an assessment of how their
misspecification may affect clinical administration of propofol. Anesthesiologists are
quite good at administering propofol safely and effectively and currently available TCI
pumps are likewise capable of safe administration. This study was not designed to
establish clinical superiority of one model over another. However, we can and have
concluded that the performance of neural networks in modeling prospectively collected
pharmacokinetic data is limited by the size of the data. This is a key limitation in the
clinical utility of such models and outlined as such in our discussion (P22 LM 33-54).
We have added to the discussion (P23 paragraph 2) the following that emphasizes that
our models have not been clinically validated for performance: “None of our models
have been prospectively validated to assess their clinical performance. However, we
performed simulations of a standard induction dose of propofol (2 mg/kg given over 10
seconds), and compared these to a clinically validated model (Supplementary Figure
5). Concentration-time profiles between the four-compartment and GRU models as
well as the model published by Schnider et al. were similar. All three models
demonstrated peak concentrations that were higher than the recirculatory model.”
6. Simulations of clinical scenarios of model driven propofol administration to illustrate
model performance differences between the three models studied.
We thank the reviewer for the excellent suggestion of performing simulations for
evaluation of each models performance and for comparison against previously
published propofol PK models. Please see above response.
B. The methods mention measuring cardiac output using noninvasive methods. Was
this data used as a covariate in any of the model building (it is not mentioned in the
results). If not, consider removing.
Yes, cardiac output was measured noninvasively in all subjects and was tested as a
potential covariate in our models. We will have added to the results (P17) that this
covariate was tested during model building.
C. TBW & LBW were used. Were other weight scalars considered? Why/why not?
TBW and LBW were the only weight scalars used in this study. TBW and LBW provide
objective, quantifiable measurements of subjects’ physiologic body composition. Our
prior study showed that LBW is an optimal dosing scalar for administering an induction
dose of propofol (Ingrande J, Brodsky JB, Lemmens HJ. Lean body weight scalar for
the anesthetic induction dose of propofol in morbidly obese subjects. Anesth Analg
2011 Jul;113(1):57-62). In this study, propofol dose per kilogram LBW in morbidly
obese patients was similar to dose per kilogram TBW in lean patients. Propofol was
given to lean patients based upon TBW because this is the standard dosing scalar in
these patients and, by definition TBW and LBW are closely approximated in lean
patients.
D. Pointing out that you are the first to publish an idea may be construed by readers as
not that compelling. Consider removing such statements and let the innovation speak
for itself.
We apologize for this type of statement and agree with the reviewer. We have
removed the statement “Our study is the first to compare the performance of an ANN to
conventional MEM’s in characterizing this type of data,” from the manuscript (P23 L31-
34).
Statistical Editor Comments:
This manuscript was reviewed by a statistical reviewer (#4) who makes very important
points about the statistical methods and reporting.
Authors should follow these recommendations and also show in detail how they have
addressed each point.
If you can address point-by-point my comments and those of the reviewers (see
below), I will be happy to receive a revised version of the manuscript. However, I
cannot promise that your revised version of the manuscript would achieve the priority
necessary for publication in Anesthesia & Analgesia.
If we do not receive a revised manuscript from you within the 8 weeks, or a letter from
you indicating your indication of sending a revised manuscript, I will assume that you
have elected to decline to revise your manuscript.
If you choose to revise your manuscript, please submit your revision via Editorial
Manager by logging in to your author account and clicking the link "Submissions
Needing Revision." Be sure you have pasted your response to the reviewers into the
appropriate box on the online submission site.
With all good wishes,
Ken B. Johnson, MD
Executive Section Editor
Anesthesia & Analgesia
---
Jean-Francois Pittet, MD
Editor-in-Chief
Anesthesia & Analgesia
*******************************************************
Reviewer Comments to the Author:
Reviewer #1:
The authors compared the initial pharmacokinetics of a profofol infusion using 3
different pharmacokinetic models: 1. 4 compartment, 2. Re-circulatory and 3. An AI
model using long short-term memory and a gated recurrent unit. Blood was drawn from
3 subjects using a closed loop system frequentlyfor up to 16 hours. The AI model and
the re-ciculatory models out performeed the 4 compartment model.
Major concerns:
In the abstract and throughout the manuscript the authors refer to high-resolution
sampling and data set. I believe that they are most likely suggesting a high sampling
rate of the blood samples and not on the resolution by which you are preforming your
analysis. Base on the Nyquist Sampling Theorem: A band-limited continuous-time
signal can be sampled and perfectly reconstructed from its samples if the waveform is
sampled over twice as fast as it's highest frequency component. Is this true of the rate
of sampling? What variable that is important in this waveform has the highest
frequency? Maybe heart rate (stroke volume). I think it would be better to drop the
high-resolution and refer to data as having a frequent sampling rate.
Thank you for this comment. We agree with the reviewer that based on the Nyquist
Sampling Theorem, calling our sampling scheme “high-resolution” is a poor descriptor.
We have dropped “high-resolution” in favor of “frequent sampling rate” per the
reviewer’s suggestion.
Abstract pg4, ln 59 - slightly has no quantitative meaning, please provide an estimate
of bias or another metric that demonstrates over prediction.
We thank the reviewer for this suggestion. We have removed the word “slightly”.
Pursuant to the point below, we have reported mean bias error and root mean square
errors for the four models, for quantitative measurements of bias and accuracy. We
have also reported the confidence intervals of these measurements per the reviewer’s
request (see response below).
Pg5, ln 8; I don't see any estimate of under prediction bias in the manuscript. Provide a
quantitative estimate of the under-prediction and the confidence intervals of the bias
estimate.
Thank you. Inspection of figures 2 and 4 demonstrate the under-prediction bias seen
in the four-compartment model after the first 5 minutes. We agree that the overall
mean bias is positive. Per the reviewer recommendations, we have included the
confidence intervals of the bias, which demonstrate, quantitatively, that this model
suffers from both negative and positive bias. We have changed P5, LM 8 to read:
“which suffered from over-prediction bias during the first 5 minutes followed by under-
prediction bias after 5 minutes.”
Pg 6 ln-33 - Low resolution refers to not sharply defined. Again I think this is in
reference to sampling frequency. Also, once you exceed the Nyquist Sampling Rate
you rarely gain much in terms of reconstruction of the signal. How do you know you did
not over-sample?
Thank you. We have dropped “low resolution” in favor of “infrequent blood sampling”
(P6 L33) Regarding the question of over-sampling, this study was designed to sample
frequently enough to capture the peak propofol concentration and fast decline in
concentration immediately after the slow bolus administration of the drug. (Fisher D.
Almost everything you learned about pharmacokinetics was somewhat wrong. Anesth
Analg. 1996 Nov;83(5):901-3). Our sampling strategy has been modeled after
previously published studies (Masui et al. Early phase pharmacokinetics but not
pharmacodynamics are influenced by propofol infusion rate. Anesthesiology 2009
Oct;111(4):805-17)
Pg 9; ln 22: Please define obese
We apologize for this omission. We have provided body mass index criteria for the
morbidly obese and lean groups. P9 L 22 reads: “Thirty subjects were enrolled (20
morbidly obese, body mass index ≥40; 10 lean, body mass index <25).”
Pg 12: Equation 2: I believe the eij term is missing in the equation.
Thank you. This has been added to equation 2.
My primary problem with the manuscript is that the estimates of bias and goodness of
fit chosen the MSE and the MPE are not what I would consider the standard measures
of these constructs. I believe that the mean bias deviation or the mean bias error would
be a more appropriate estimate of the model bias and that the confidence intervals
could be constructed so that a statistical estimate of the difference from zero could be
made and presented as a quantitative estimate of the bias, rather than just
visualization as is currently reported. Likewise I believe the investigators should report
the RMSE should be reported as a measure of goodness of fit as a measure of
accuracy.
We would kindly ask the reviewer to look back at equation 3 in our manuscript (P13;
LM4) to see that mean prediction error (as reported in our study) is the same
calculation as mean bias error (which the reviewer has requested).
The equation of mean bias error is identical to that of mean prediction error. Please
refer to the equation for mean bias error as described by T. Raventos-Duran et al.
Structure-activity relationships to estimate the effective Henry’s Law constants of
organics of atomospheric interest. Atmos Chem Phys. 2010 10;7643-7654.
In addition, the reviewer is asking for the RMSE, which is simply the root of our
reported MSE. MPE and MSE are indeed standard measures of bias and accuracy,
respectively. Both of these metrics have been used in previously published studies
comparing mixed effects models and neural networks (Chow HH, Tolle KM, Roe DJ,
Elsberry V, Chen H: Application of neural networks to population pharmacokinetic data
analysis. J Pharm Sci 1997; 86: 840-5; Chow HH, Tolle KM, Roe DJ, Elsberry V, Chen
H. Application of Neural Networks to Population Pharmacokinetic Data Analysis. J
Pharm Sci. 1997 Jul;86(7):840-5)
We have added a supplementary table including the MPE, MSE, Objective Functions
(for the mixed-effects models) and the confidence intervals of the bias and precision
per the reviewer’s request.
Please provide 95% confidence intervals for the curves in figures 2 and 3 for the
smoothed lines in figures 2 and 3.
Figures 2 and 3 provide a lowess smoother of the direction of the bias. By definition,
the smoother is not a model fit, hence there can be no calculations of confidence
intervals from these plots. These plots were included because they provide a graphical
representation of general bias direction versus time and observed concentration.
These plots have been published by our group in a prior study published in Anesthesia
and Analgesia (Ingrande et al. Pharmacokinetics of cefazolin and vancomycin in
infants undergoing open-heart surgery with cardiopulmonary bypass. Anesth Analg.
2019 May;128(5):935-943)
The Visual Predictive Checks provide a robust graphical measurement of the ability of
the model to reproduce the variability in the observed data. These plots are presented
in Supplementary Figures 3 and 4 and provide the prediction interval in the variability
that we believe the reviewer is asking for. The visual predictive checks demonstrate
that only a small percentage of the data falls outside of 5 and 95% confidence
intervals.
Reviewer #2:
This experimental study was performed to derive and compare the performance of 3
types of models (traditional mammillary compartment model, a recirculatory model and
a gated recurrent unit neural network) during early and late propofol pharmacokinetics
in morbidly obese and lean subjects. The authors hypothesize that since artificial
neural networks are devoid of structure, they offer advantages over the other
approaches in modeling complex non-linear systems.
PK data was obtained from 17 morbidly obese and 7 lean subjects receiving propofol
for induction of anesthesia. Main results show that the final recirculatory model and the
gated recurrent unit neural network had similar performance. Both models, however,
tended to over-predict propofol concentrations during the induction and elimination
periods. Both models showed superior performance compared to the four-
compartment model which showed under-prediction errors. The relatively small dataset
of this PK study was a limitation to adequately train the neural network model.
In my opinion this is an interesting study assessing the capabilities of neural network
modeling in complex PK scenarios. However, the fact of adding obese and lean
subjects adds an additional complexity that needs to be better explained in their
modeling analysis.
Major concern:
1) Previous propofol PK studies in obese and lean patients have consistently found
that volumes and clearances increase with weight and therefore size scaling most
commonly needs to be incorporated in model parameters. As far as I can see the
authors did not incorporate size covariates in their 4-compartment and recirculatory
models. I think it will be important to better explain how size affected model parameters
in these models. As I understand lean body weight and total body weight (and other
covariates) were incorporated in the neural network model. Please clarify.
Thank you for this inquiry. We do agree with the reviewer that other previously
published models do indeed incorporate size descriptors. We did explore the
relationship between covariates and PK parameters in the 4-compartment and
recirculatory models including size covariates as the reviewer mentioned. To clarify
this point, we have added the following statement to page 17 LM 29: “Covariates
analyzed included, TBW, LBW, age, cardiac output, and gender.” We changed P17
LM53 to read: ” Analysis of the relationships between model parameters and measured
covariates (TBW, LBW, age, cardiac output and gender) revealed a positive linear
relationship between V4 and V5 with age greater than 65 years.”
None of these covariates had a significant impact relationship and did not improve
model fits when incorporated into the models in forward selection manner. We
therefore left the covariates out of the model and presented the simplest model as our
final model. To provide better clarification that the covariates were not included in the
accepted final model we added (P18L6): “The combined error recirculatory model
(without covariates) was therefore accepted as the final model.”
2) I could not find parameters estimates for the 4 compartment final model.
We apologize for this omission. We have now incorporated these estimates in Table 1
per the reviewer’s suggestion.
Minor concern
Please clearly state in predicted vs observed concentration diagnostic plots if they are
population or post hoc estimates.
Thank you. We have changed the title of Figure legends 2 and 3 to clarify that these
are population estimates. We have changed the title to Figure 2 to read: “Propofol
Observed/Population Predicted Ratio vs. Time”
Reviewer #3:
Ingrande et al submit an original research article detailing the performance of an
artificial neural network model in predicting the early distribution kinetics of propofol in
morbidly obese and lean subjects. The authors found similar performance between a
recirculatory model and GRU neural network. The authors observed plasma
concentrations of propofol up to 16 hours following induction via infusion. Given the
short duration of action of propofol, the concentrations and ability of the models to
predict plasma concentrations are likely not clinically relevant. Overestimation of
propofol dosing is unlikely to lead to harm and most clinicians (as evidenced by the
author's own methods) titrate medications for induction to effect. Given the amount of
data on the pharmacokinetics and dynamics of propofol, I am unsure what the use of
another model would provide clinically although I acknowledge that it may yield some
knowledge regarding the PK/PD modeling of other drugs which may have truly been
the authors primary goal. Additionally, it is unclear to me why TBW was used in the
lean group instead of LBW as these weights are different.
This study is not meant to be a “me too” study describing another PK model of
propofol. Neural networks have shown promise and have won praise for their ability to
model complex and large datasets and have been proposed as a possible replacement
for conventional mixed effects models (Gambus P, Shafer SL. Artifical Intelligence for
Everyone. Anesthesiology 2018;128:431-433). What is unknown is how well neural
networks perform when modeling prospective PK data. This study was designed to
evaluate the performance of neural networks and compare their performance to
conventional mixed effects models in modeling a large, frequently sampled
pharmacokinetic data set. Results of our study demonstrate the limitations of neural
networks when analyzing smaller datasets.
We chose to administer propofol to obese patients (BMI >40) based on LBW because
our prior study found that LBW is an appropriate an optimal dosing scalar for this drug.
(Ingrande J, Brodsky JB, Lemmens HJ. Lean body weight scalar for the anesthetic
induction dose of propofol in morbidly obese subjects. Anesth Analg 2011
Jul;113(1):57-62). By definition, TBW and LBW are closely approximated in lean
patients. In our prior study, propofol dose per kilogram LBW in morbidly obese patients
was similar to dose per kilogram TBW in lean patients.
Specific issues to be addressed by the authors:
Abstract:
P4L19- I believe you mean to say "in that they are devoid"
Thank you for this correction. Per reviewer #4’s comments, we have changed this
statement to read: “in that they are flexible and not limited to a specific structure and
therefore may be superior in modeling complex non-linear systems.”
Introduction:
It might be useful to describe what specific clinical problem or question your study
attempts to answer. Given the overwhelming pharmacodynamic and pharmacokinetic
data on propofol, what is the purpose of these improvements?
We appreciate the reviewer’s suggestion. We agree with the reviewer that there are
numerous PK models of propofol. The primary objective of this study was to evaluate
the performance of a neural network in modeling this high frequency dataset and
compare the performance to mixed models. We have added to the introduction the
following (P7 L29): “The objective of this study is to compare the performance of a
compartmental model, recirculatory model and an ANN to describe propofol
pharmacokinetics from a frequently sampled dataset. We hypothesize that the ANN
will have better performance because of its ability to model complex non-linear
systems without assuming a particular structure.”
P7L35- Please conclude the introduction with a statement of specific primary and any
secondary outcomes as well as your study hypothesis.
We thank the reviewer for this excellent suggestion. Per the response above we have
concluded the introduction with our hypothesis to read as follows (P7 L29): “The
objective of this study is to compare the performance of a mammillary compartmental
model, recirculatory model and an ANN to describe propofol pharmacokinetics from a
frequently sampled dataset. We hypothesized that the ANN would have better
performance because of its ability to model complex non-linear systems without
assuming a particular structure.”
Methods:
P8L21- What was considered obese? What was considered lean? Please specify in
the methods.
We apologize for this omission. We have included a definition of obese and lean
subjects.. P8L21 reads: “Thirty subjects were enrolled (20 morbidly obese, body mass
index ≥40; 10 lean, body mass index <25).”
P8L24- What surgeries were the participants undergoing? Were these all
approximately the same duration? Specifically, were there any surgical factors that
might have affected drug metabolism or excretion.
We have added to the methods (P8 L24): “Obese patients underwent elective
laparoscopic sleeve gastrectomies, gastric bandings and Roux-en-Y gastric bypasses.
Lean patients underwent a variety of elective general, plastic, gynecologic, and ENT
cases.”
The majority of cases were of similar duration (approx. 2-3 hours) and we do not
believe there were any surgical factors (blood loss, fluid shifts, reductions in liver or
kidney blood flow) to affect drug metabolism or excretion. Patients were excluded from
the study if they had any history or evidence of hepatic, or renal disease.
P8L55- Was arterial line placement considered standard of care in the care of these
patients?
The placement of the arterial line was specific to this study protocol. We have clarified
by adding the statement “Per study protocol” to the manuscript.
P9L26- As calculated LBW and TBW differed in the lean group, what was the rationale
in using TBW in this group as opposed to using LBW for both groups. Please explain.
Our prior study demonstrated that LBW is an appropriate weight based scalar for
morbidly obese subjects and that obese subjects given propofol based upon LBW
required similar amounts of drug compared to lean subjects given propofol based upon
TBW. Please also refer to our above response and to our response to the section
editor.
P11L17- Was an a priori power analysis performed? If so, please specify or state how
the sample size was determined.
Thank you for this inquiry. We did not perform an a priori power analysis. Power
analyses are uncommon for PK studies, as they are, in general, exploratory in nature.
This study was similarly exploratory in nature, and we were therefore unable to predict
a priori how many parameters would be contained in our final model(s).
The number of subjects and number of observations per subject are in line with a prior
PK study which also employed frequent blood sampling (Masui et al. Early phase
pharmacokinetics but not pharmacodynamics are influenced by propofol infusion rate.
Anesthesiology 2009 Oct;111(4):805-17) and other previously published PK models of
propofol where the data was prospectively gathered (Knibbe et al. Population
pharmacokinetic and pharmacodynamics modeling of propofol for long-term sedation
in critically ill patients: a comparison between propofol 6% and propofol 1%. Clin
Pharmacol Ther. 2002 Dec;72(6):670-84).
Results:
P17L28- Please define these abbreviations when first used.
We apologize for this omission. We have defined the abbreviations in question and
have included the statement on P17 L28: “Objective function (OBJ), mean prediction
error (MPE) and mean square error (MSE) are reported (Supplementary Table 1).
Discussion:
P20L9- Please begin the discussion with the primary results of this study.
Thank you for this suggestion. We have rearranged the discussion and it now begins
with the primary results of this study. P20L9 reads as follows:
“Artificial neural networks (ANN’s) have been praised for their ability to model complex
non-linear data and have been proposed as a possible replacement for mixed effects
models.8 This study aimed to evaluate the performance of a conventional mammillary
compartmental model, a recirculatory model and an ANN in characterizing the PK of
propofol using a frequently sampled prospectively collected dataset. A GRU model had
comparable performance to the recirculatory model, with both having better
performance compared to the four-compartment model.”
Reviewer #4:
It's not clear that the authors approach draws a fair comparison of the PK methods
versus the ANN in predicting drug concentration. The former models were fitted using
traditional criteria (-LL) whereas the latter were fitted by optimizing a cross-validated
estimate of predictive performance, which would seem to advantage the ANN method
relative to the PK method, with regard to predictive performance.
It is accurate that using a careful optimization approach confers advantages over
traditional approaches. However part of the goal of this paper is simply to demonstrate
the advantage of using "modern" machine learning techniques, which are capable of
directly optimizing the error metrics. We show how these techniques can be applied to
predict drug concentration, following current practice in machine learning. Our models
and experiments are also intended to give a sense of how much training data is
required to confer an advantage, and to demonstrate how we deal with issues of small
datasets (which require careful cross-validation approaches, among other things).
In addition, the ensemble approach confounds the ability to directly compare the
approaches, since the ensemble approach uses input from the PK models. It seems it
would thus be impossible to do worse with the ensemble than with the PK model.
Such an ensemble method is guaranteed to do better than the PK model only at
training time. At test time the model is susceptible to overfitting, and it is not
guaranteed to outperform the PK model.
Ensembling was necessary to avoid overfitting by the ANN. The ANN model is highly
flexible and quickly overfits to the small datasets used here. The use of the PK method
acts more like a form of regularization -- it ensures that the ANN model starts from a
reasonably good solution, and does not get stuck in weak local optima (or otherwise
local optima that are strong on the training set but weak on the test set). This was
necessary given the small datasets we work with. We expect this would not be the
case if larger datasets were available and that no ensembling would be necessary.
In general, I feel that the authors' descriptions and documentation of the ANN
approaches in particular are likely insufficient. Providing some code
(Keras/TensorFlow?) may help with this.
A link to the code is provided: http://jmcauley.ucsd.edu/propofol_dec19.html. We
have referenced this link on P 18 L24: “ Tensorflow code for the GRU model can be
found at at http://jmcauley.ucsd.edu/propofol_dec19.html.”
P4L19: I feel that "devoid of structure" does not convey the authors' intention, which is
that neural networks are much more flexible at modeling the relationship between input
(drug administration) and output (drug concentration), relative to conventional
parametric PK models. I recommend the authors consider alternative wording here.
We apologize for this vague and incorrect wording. We have changed this wording
(P4L19) to read “in that they are flexible and not limited to a specific structure and
therefore may be superior in modeling complex non-linear systems.”
P4L54: I believe that the numbers listed for mean prediction error are not in the correct
order. The authors write that the recirculatory model and GRU model both slightly over
predicted, but the signs of the mean prediction error are different. Please revise.
Thank you and please excuse this error. The MPE for the recirculatory and GRU
models should read 0.348 and 0.161, respectively.
P13L40: It's not clear here what goodness of fit plots were considered. Please clarify in
the text.
Thank you. We have clarified this statement to read: “Goodness of fit plots, including
the ratio of observed to predicted concentrations vs. time, population predicted vs.
observed concentrations, and overall model fit, were performed for model evaluation.”
P13L49: What was the purpose of these plots. And how was inestimability of THETA
determined? Also, please clarify the meaning of THETA here, as some readers may
not be familiar with NONMEM and the naming convention used there.
Log likelihood plots are a graphical diagnostic of model overfitting. Briefly, the
parameter being tested is changed in small increments (iterations) while the other
parameters of the model are fixed. The plots ensure that 1: the model can converge
around a minimum value (-2LL), graphically viewed as a parabolic-like shape, and 2:
the convergence is ideally near the original parameter estimate.
We agree with the reviewer that the meaning of THETA should be defined. To define
the meaning of THETA, we have changed the sentence to read: “After identification of
the final model, log-likelihood plots were performed for each model parameter estimate
(THETA).”
P13L49: Some justification of the sample size is required by journal policy.
We thank the reviewer for this inquiry. Please see responses #2-3 to the Section
Editor and response #9 for Reviewer #3. In summary, we did not perform an a priori
power analysis, as this was a prospective, exploratory PK analysis. Power analyses
are uncommon in PK studies, and we remind the reviewer that we recently published a
PK study in this journal without reporting a power analysis (Ingrande et al.
Pharmacokinetics of cefazolin and vancomycin in infants undergoing open heart
surgery with cardiopulmonary bypass. Anesth Analg. 2019 May;128(5):935-943). The
sample size of this study is in line with previously published studies analyzing the PK of
propofol Masui et al. Early phase pharmacokinetics but not pharmacodynamics are
influenced by propofol infusion rate. Anesthesiology 2009 Oct;111(4):805-17; Knibbe
et al. Population pharmacokinetic and pharmacodynamics modeling of propofol for
long-term sedation in critically ill patients: a comparison between propofol 6% and
propofol 1%. Clin Pharmacol Ther. 2002 Dec;72(6):670-84).
P13L58: The bootstrap internal validation method needs additional description in the
text. Did the authors use a bootstrap that considers the correlation among repeated
samples? What metrics were validated? Was this done by evaluating each new fitted
model (x 1000) on the original data? Were visual predictive checks performed for each
of the 1000 fits? Was there a mechanism to correct for overfitting of the PK models?
Thank you. We have expanded this section (P13L49 to P14L9) to provide more detail
into the boostrapping, VPC, and log-likelihood plot methods. It now reads:
“After identification of the final model, log-likelihood plots were performed for each
model parameter estimate (THETA). If a parameter that was included in the model
could not be reliably estimated from the data, the model was assumed to be overfit and
the model was rejected.
Bootstrap analysis of the final model was performed for internal validation. The
bootstrap was performed by first creating 1000 new datasets of the same length as the
original dataset by resampling at random from the observed dataset. The final model
was fit to each dataset, and the distribution of the parameter estimates (THETA’s)
examined, ultimately providing mean, median and percentiles of each of the parameter
estimates in the model.
Prediction-corrected visual predictive checks were performed to graphically assess
whether simulations of each model can reliably predict the central trend (50th
percentile) and variability (5th and 95th percentiles) in the observed data. Each model
was used to create 1000 new datasets each containing simulated propofol
concentrations. The 5, 50, and 95th percentiles of the simulated data were compared
to the same percentiles of the actual observed data.”
P14L47: "Before training, the hyperparameters (number of layers, number of nodes in
hidden layer) were optimized via grid search" It's not clear how optimization can take
place prior to training. Please clarify in the text here. The description of the overall ANN
approach is not clear here. Were separate ANN's fit for each subject or combined
(naively?). Why was the fold size selected to be 2? What metric was used for validation
(mean square error)?
We apologize for any mis-clarification. We have clarified these methods (P14;L47) per
the reviewer’s request as follows:
“The cross-validation setup consisted of 22 data points and 11 folds. During each fold,
18 points were used for training, 2 for validation, and 2 for testing. In this way each
data point appears in a test fold exactly once, such that we can accumulate a test error
that accounts for all of the data points. During each fold, the validation set was used to
select the best model (i.e., training was terminated once the validation error did not
improve further).”
The ANN model can optimize either the MSE or the Relative Error (with some difficulty
for the latter as the derivative can be somewhat ill-conditioned). When reporting a
specific error we use an ANN model optimized for that error, and compute that error on
the validation set to select the best model.
P15L24: Please clarify how propofol dose differed from propofol administration rate.
Was it cumulative?
Propofol dose was the absolute amount (in milligrams) of propofol that was given to the
patient. The administration rate was the rate that was programmed into the infusion
pump.
We have clarified P15L24 to read: “For the ANN, the input layer consisted of 9 nodes
for each time step (gender, LBW, TBW, age, total propofol dose administered (mg),
rate of propofol administration (mg/min)), 2 hidden layers (10 nodes per layer), and 47
outputs, each node representing propofol concentration at a distinct time point.“
P15L42: 'absolute difference' was the mean absolute difference used as the validation
metric? If so, why use something other than mean square error?
The mean square error was used as the validation error metric. We apologize for this
error and the word “absolute” has been omitted and P15L42 has been changed to
read: “--defined as the mean square error (Equation 4).”
P18L35: "under-prediction" This seems contradictory with the reported positive mean
prediction error for both of these models. Please explain.
We apologize for this error and agree with the reviewer that the mean prediction error
is positive, and Figures 2A (Panels B and D) clearly show an observed to prediction
ratio that is less than 1, indicating that the predictions are higher than the observed
values. We have corrected P18L35 to read “over-prediction”.
P18L42: "over-predicted" It's not clear how to reconcile this with the previous
sentences.
We thank the reviewer for pointing out this error. There is an over-prediction as
mentioned (P18L42). We have corrected the previous statement in question (as
above).