Datasets
Two datasets, one for model building and one for validation, were constructed from a search in the Utrecht University Veterinary Diagnostic Laboratory patient files. The modeling dataset contained all dogs that were found to have a Babesia parasitemia from 2002 to 2013. A total of 87 dogs with Babesia parasitemia, confirmed by blood smear analysis, were enrolled. Data of control dogs (n = 1144) were collected from November 2010 through January 2011. In only 63 dogs with Babesia and 294 control dogs, all parameters were measured. In the other dogs no reticulocyte parameters were measured.
The validation dataset contained 13 dogs that tested positive for B. canis from 2017 up until June 2020. Data from control dogs (n = 5649, with 5540 unique patients) were collected from January 2017 through September 2018. Also, dogs in which Anaplasma phagocytophilum was found (n = 29), from 2017 through June 2020, were present in this control group.
In all dogs of both datasets, 214 different ADVIA parameters related to erythrocytes, reticulocytes, platelets and leukocytes were recorded. In the 2002–2013 period, blood was analyzed with the ADVIA 120 and in the 2017–2020 period with the ADVIA 2120i.
Building a model with conventional statistics
As a first step in the modeling dataset the means and data within 1 and 2 standard deviations of the mean (1 SD, 2 SD) were calculated for each feature for the Babesia group and the control group. Next, those parameters were identified in which the percentage of Babesia dogs was most outside the range of mean ± 1 SD or mean ± 2 SD of the control dogs. Following this, the area under the curve (AUC), optimal cutoff value, sensitivity, specificity and positive likelihood ratio (LR+) were calculated for each of these parameters using receiver-operating characteristic (ROC) curves. Then, those parameters that had an AUC > 0.70 were combined into one model to increase diagnostic accuracy, and the sensitivity, specificity and LR+ were calculated for each of these combinations. Finally, the model was used on the validation dataset.
A commercially available software package (SPSS 27.0, IBM SPSS Statistics for Windows, Armonk, NY, USA) was used for data analysis. ROC curves as well as calculation of the AUC were made with commercially available MedCalc® Statistical Software version 20.009 (MedCalc Software Ltd, Ostend, Belgium).
Machine learning
A classification model was trained on a supervised learning task, \(f\left(\overrightarrow{x}\right)= y.\) Here, \(f\) represents the model, \(y\) is the (binary) label indicating whether the subject has a B. canis infection, and \(\overrightarrow{x}\) are the input features (i.e. a selection of ADVIA parameters). Several different classifiers—logistic regression, decision tree, random forest and XGBoost [17]—were trained. The tree-based models (decision tree, random forest and XGBoost) can capture non-linear relations in the data.
All selected ADVIA parameters, without any further preprocessing, are used in the tree-based classifiers. For the logistic regression data were first scaled by subtracting the mean and dividing by the standard deviation, i.e. using a standard scaler, after which the K best predictors—identified by univariate feature selection—were used as input features. Here, K was treated as a hyperparameter. A hyperparameter is a configuration parameter of the model that is not directly learned from data as opposed to model parameters such as the coefficients of a logistic regression.
The train (validation) set contains 1144 (5649) negative samples and 87 (13) positive samples (see section on the dataset). First, ten-fold cross validation on the training data was applied to tune hyperparameters and estimate out-of-sample performance. In the cross-validation procedure the training data are split into ten folds. A model was trained on nine folds, and its performance was assessed on the unseen tenth fold. This procedure was repeated ten times to derive performance metrics based on the training data. Hyperparameter tuning, i.e. optimizing the configuration parameters of each classifier such as the maximum depth of the tree-based classifiers, was done using HyperOpt [18]. HyperOpt uses Bayesian optimization to efficiently try new hyperparameters based on their expected performance, which was measured through the AUC. All experiments were logged in MLflow, and optimal hyperparameters were chosen for each classifier based on the AUC obtained from the cross-validation procedure. Results presented for the training data are for the optimal set of hyperparameters of each classifier. Next, each classifier with its chosen hyperparameters was trained on the full training data and then evaluated on the validation data to assess the generalization error. An overview of the workflow is provided in Fig. 1.
As a threshold for positive predictions, we used the value corresponding to a 95% sensitivity as extracted from the ROC curve; 95% confidence intervals (CI) on the accuracy, sensitivity and specificity are computed by bootstrapping the data 1000 times. Bootstrapping results are not presented for the sensitivity of the validation data, since they only contain a limited number of positive samples.
All data preprocessing, model training and evaluation are performed in Python 3.7 using the packages MLflow 1.11.0, NumPy 1.18.2, pandas 1.0.3, scikit-learn 0.22.2, shap 0.39.0 [19] and XGBoost 1.2.0.
Methods and results are reported in accordance with “MINimum Information for Medical AI Reporting” (MINIMAR) [20], a recently proposed standard for medical artificial intelligence (AI) reporting. Analysis code for the machine learning models is made publicly available via GitHub.