Prediction of Allogeneic Hematopoietic Stem-Cell Transplantation Mortality 100 Days After Transplantation Using a Machine Learning Algorithm : A European Group for Blood and Marrow Transplantation Acute Leukemia Working Party Retrospective Data Mining Study

Purpose Allogeneic hematopoietic stem-cell transplantation (HSCT) is potentially curative for acute leukemia (AL), but carries considerable risk. Machine learning algorithms, which are part of the data mining (DM) approach, may serve for transplantation-related mortality risk prediction. Patients and Methods This work is a retrospective DM study on a cohort of 28,236 adult HSCT recipients from the AL registry of the European Group for Blood and Marrow Transplantation. The primary objective was prediction of overall mortality (OM) at 100 days after HSCT. Secondary objectives were estimation of nonrelapse mortality, leukemia-free survival, and overall survival at 2 years. Donor, recipient, and procedural characteristics were analyzed. The alternating decision tree machine learning algorithm was applied for model development on 70% of the data set and validated on the remaining data. Results OM prevalence at day 100 was 13.9% (n 3,936). Of the 20 variables considered, 10 were selected by the model for OM prediction, and several interactions were discovered. By using a logistic transformation function, the crude score was transformed into individual probabilities for 100-day OM (range, 3% to 68%). The model’s discrimination for the primary objective performed better than the European Group for Blood and Marrow Transplantation score (area under the receiver operating characteristics curve, 0.701 v 0.646; P .001). Calibration was excellent. Scores assigned were also predictive of secondary objectives. Conclusion The alternating decision tree model provides a robust tool for risk evaluation of patients with AL before HSCT, and is available online (http://bioinfo.lnx.biu.ac.il/ bondi/web1.html). It is presented as a continuous probabilistic score for the prediction of day 100 OM, extending prediction to 2 years. The DM method has proved useful for clinical prediction in HSCT. J Clin Oncol 33:3144-3151. © 2015 by American Society of Clinical Oncology


INTRODUCTION
Allogeneic (allo) hematopoietic stem-cell transplantation (HSCT) is a potentially curative procedure for selected patients with hematologic disease.Despite a reduction in transplantation risk in recent years, 1 morbidity and mortality remain substantial, making the decision of whom, how, and when to perform transplantation of great importance.
Numerous parameters affect transplantationrelated risk.When indicated, clinical judgment of-ten plays a key role in patient selection. 2Risk scores for mortality prediction, such as the European Group for Blood and Marrow Transplantation (EBMT) risk score and the Hematopoietic Cell Transplant-Comorbidity Index (HCT-CI), [3][4][5] may aid decisions.][8][9] The development of large and complex registries, 10 incorporating biologic and clinical data, and the need for improved prediction models generate the drive to apply machine learning (ML) algorithms for clinical predictions. 11ML is a field in artificial intelligence stemming from computer sciences.The underlying paradigm does not start with a predefined model; rather, it lets the data create the model by detecting underlying patterns. 11Thus, this approach avoids preassumptions about model types and variable interactions, and may complement standard statistical methods. 12,13ifferent algorithms are used to produce a function, a model, which will fit the data and not the other way around.In such procedures, many variables and combinations thereof can be used, and models are developed on a training set and validated on a test (ie, validation) set. 11L algorithms are part of a wider approach, called data mining (DM), for analyzing large and complex data sets.Such algorithms have been used in various financial and technologic applications and are gradually entering clinical use. 11DM is a multidisciplinary field seeking to discover knowledge in databases in a systematic and automatic process. 14A primer on the DM method in HSCT has been published by Shouval et al. 11 The need for improved risk assessment of allo-HSCT and the potential benefits of the DM approach served as the rationale for undertaking the current study.We have applied such an approach on a large cohort of patients with acute leukemia (AL) to develop an ML-based prediction model of overall mortality (OM) 100 days after allo-HSCT.We then assessed the model's ability to predict outcomes at 2 years.

Study Design and Outcomes
This was a retrospective, DM, supervised learning study, on the basis of data reported to the Acute Leukemia Working Party registry of the EBMT.The EBMT is a voluntary working group of more than 500 transplantation centers, required to report all consecutive HSCT and follow-ups annually in a standardized manner.The registry is routinely audited.The study was approved by Acute Leukemia Working Party.
The primary objective was prediction of OM 100 days after allo-HSCT.Secondary objectives were the estimation of overall survival (OS), nonrelapse mortality (NRM), leukemia-free survival (LFS), and relapse incidence at 2 years, according to the score predicted for day100 OM.
All outcomes were measured from the time of allo-HSCT.Day 100 OM was defined as death from any cause before day 100; NRM was defined as death without previous relapse/progression; LFS was defined as survival without leukemia progression or relapse; and relapse was defined as leukemia recurrence at any site.Cumulative incidence functions were used to estimate 2-year NRM and relapse after transplantation, taking into account the competition between these two events. 15Probabilities of OS and LFS at 2 years were calculated using the Kaplan-Meier estimate. 16Patients were censored at time of the last follow-up.

Population and Variables
Per protocol, inclusion criteria encompassed first allogeneic transplantations, performed from 2000 to 2011, using peripheral blood stem cells or bone marrow as the cell source, in adults age Ն 18 years diagnosed with de novo AL.Haploidentical transplantations were excluded.
A total of 29,685 patients from 404 European centers were initially analyzed.Patients lost from follow-up before day 100 after HSCT were discarded from analysis (n ϭ 1,449, 5%; Data Supplement).Twenty variables describing recipient, donor, and procedural characteristics were considered.Variables were defined according to EBMT criteria 17 and are detailed in Table 1 and the Data Supplement.

Alternating Decision Tree
The alternating decision tree (ADT) is an ML algorithm designed for prediction.It generates alternating levels of prediction and decision nodes, denoted as ellipses and rectangles, respectively.Each prediction node is associated with a weight, representing its contribution to the final prediction score, whereas each decision node contains a splitting attribute (ie, variable).The tree is formed through an iterative process.The iteration number, in which the decision node was introduced to the tree, is an arbitrary measure of its importance as a decision rule (ie, lower iterations correspond to higher importance). 18The first level of decision nodes represents independent variables, whereas daughter decision nodes are dependent on previous decisions.
Prediction with ADT involves pursuing multiple paths, corresponding with the instance features, with the same variable possibly playing multiple roles in different places along the tree.To calculate the score, one starts at the root and proceeds along multiple paths down the tree, according to the following rules: If the node is a prediction node, proceed along all of the dotted edges emanating from it; if the node is a decision node, proceed along the edge corresponding to the instance characteristics (Fig 1).The cumulative score gathered by an instance (ie, patient) is the sum of the prediction values along all paths that the patient traverses in the decision tree.A positive score implies membership of one class, and a negative score, membership of the other class.Higher absolute scores are associated with higher probability of a certain binary outcome (ie, day 100 OM).In the current study, we did not choose a threshold for classification, but used the cumulative score as a continuous probabilistic measure for classification.0][21] For a detailed description of the algorithm, see Freund and Mason 18 and the Data Supplement.

Model Development and Validation
The ADT algorithm was applied for prediction model development.The study cohort was randomly divided into training (n ϭ 19,765; 70%) and validation (n ϭ 8,471; 30%) data sets.The algorithm was trained and tested using 10-fold cross validation on the training data set and validated on the validation set (Data Supplement).In addition, a separate Cox regression model, including the variables selected by the ADT model, was simultaneously developed on the training set for prediction of day 100 OM and then validated using the validation set.Software packages used were WEKA (version 3 to 7-9; http://www.cs.waikato.ac.nz/ml/weka/),SPSS 19 (http://www-01.ibm.com/software/analytics/spss/), and R version 3.0.1 (http://www.r-project.org/).For personalized score calculation, an online interface was constructed (http:// bioinfo.lnx.biu.ac.il/ϳbondi/web1.html).

Predictive Performance and Comparison With the EBMT Score
To transform the crude score into individual probabilities of day 100 OM, the training set was calibrated by entering the crude score as a covariate in a logistic regression model, with the dependent variable being day 100 OM.The quality of the score after calibration was evaluated through a reliability diagram, which verified that in each score interval, defined according to the deciles, the mean score was consistent with the observed proportions of events. 22,23The prediction model's discrimination was assessed using the area under the receiver operating characteristics curve (AUC).AUCs were computed as time-dependent receiver operating characteristic curves, and comparisons were performed by the time receiver operating characteristic software. 24

Patient Characteristics
The characteristics of 28,236 analyzed patients are listed in Table 1.The median follow-up time was 45 months.Most patients had acute myeloid leukemia (70%), were in first complete remission (CR1; 60%), and received myeloablative conditioning (MAC; 71.5%).
Grafts from matched sibling donors were used in 53.9% of patients.The graft source was mainly peripheral blood (78%).OM and NRM prevalence at day 100 were 13.9% (n ϭ 3,936) and 10.4% (n ϭ 2,928), respectively.Relapse incidence before 100 days was 9.6% (n ϭ 2,714).Infection and graft-versus-host disease were the leading causes of day 100 NRM (Data Supplement).The training and validation data sets were similar in terms of baseline variables, except for donor's sex and recipient-donor sex combination (Data Supplement).

ADT Model Output
On the basis of the training set, a prediction model for day 100 OM was developed.We applied the ADT algorithm on the training set and optimized parameters (Data Supplement) through 10-fold cross validation.Figure 1   ‡HLA allelic level compatibility: 10 of 10 (n ϭ 4,619), 9 of 10 (n ϭ 2,099), Ͻ 9 of 10 (n ϭ 1,363), and missing (n ϭ 4,947).
Table 2).Independent variables for the primary objective were disease stage, Karnofsky performance score, donor type, recipientdonor cytomegalovirus (CMV) serostatus, and HSCT year, whereas age, diagnosis, days from diagnosis to transplantation, conditioning regimen, and annual number of transplantations were dependent variables.
Selected interactions discovered by the tree include the following: patients with acute myeloid leukemia who received transplantation in CR2 had a lower risk of OM when compared with patients with acute lymphoblastic leukemia who received transplantation in the same stage (prediction node weight, Ϫ0.074 and 0.152, respectively); a shorter duration (Ͻ 142 days) between diagnosis and transplantation in CR1 or advanced-stage patients was associated with lower OM (node weight, Ϫ0.144).However, this effect was abrogated in patients age 46 years or older (node weight, 0.148).In the same disease stage categories (CR1 and advanced), older patients (age Ն 37 years) receiving reduced-intensity conditioning had lower OM risk (node weight, Ϫ0.144) when compared with MAC.In transplantations from matched unrelated donors, center experience (Ն 20 transplantations/ year) positively affected outcomes.
The year range of transplantation was incorporated in the prediction model, because HSCTs performed after 2003 were associated with lower day 100 OM rates.Nevertheless, it was not entered in the online user interface, because the model's aim is prospective outcome prediction.Thus, the year range is predefined in the Web site as 2004 and on.

Fig 1.
Alternating decision tree (ADT) prediction model for overall mortality (OM) at day 100.The ADT consists of alternating levels of prediction (ellipses) and decision nodes (rectangle).Each prediction node is associated with a weight, representing its contribution to the cumulative prediction score, whereas each decision node contains a splitting attribute.The iteration number in which the decision node was introduced is described by the number on the left side of the decision node and is inversely correlated with predictive influence.Variables are not mutually exclusive.Patients traverse the tree according to their features (ie, variable values), and the cumulative score is calculated.For example, the cumulative prediction score for a patient with the following features (plotted as black arrows on the tree): received transplantation in CR2, a Karnofsky performance score of 90, diagnosed with acute myeloid leukemia, received transplantation from a matched unrelated donor (MUD) in 2011, and both recipient and donor are cytomegalovirus (CMV) sero-negative, is Ϫ0.099 (0.065 Ϫ 0.178 Ϫ 0.057 ϩ 0.236 Ϫ 0.074 ϩ 0.064 Ϫ 0.035 Ϫ 0.12) .The score is transformed into an individualized probability of day 100 OM (10.7% for the above example) and provides the output for the online user interface.Only 6 of the 10 variables included in the model were necessary for score calculation in this patient.No. of annual hematopoietic stem-cell transplantations (HSCTs) represents the No. of annual allogeneic HSCTs performed in the individual center in the year the transplantation was performed.AML, acute myeloid leukemia; CR, complete remission; D, donor; Disease st., disease stage; dx, diagnosis; neg, negative; PS, performance score; R, recipient; RIC, reduced-intensity conditioning.

Prediction of Day 100 OM
Before calibration, individual patient scores ranged from Ϫ0.812 to 1.389.After calibrating the validation set, day 100 OM probabilities ranged from 3% to 68%.Consistency between predicted and observed probabilities for primary objectives was excellent (Fig 2).
The ADT model's discrimination for the primary objective outperformed the EBMT score (AUC, 0.702 v 0.646; P ϭ 3 • 10 Ϫ18 ).Predictive performance of the Cox model (Data Supplement), when compared with a subset of patients with available information on all 10 variables included, did not differ from the ADT model (AUC, 0.693 v 0.697; P ϭ .38;Data Supplement).

Prediction of Long-Term Outcomes
Probabilities of 2-year outcomes in each score interval for secondary objectives are summarized in Table 3 and Figure 3. Cumulative incidence of 2-year NRM was 38.2% (95% CI, 34.7% to 41.7%) for patients included in the highest score interval, with a corresponding Kaplan-Meier estimate of OS and LFS of 19.9% (95% CI, 17% to 22.9%) and 17.5% (95% CI, 14.7% to 20.3%), respectively.Probabilities of 2-year NRM, OS, and LFS, for patients in the lowest score interval, were 9.8% (95% CI, 7.9% to 12%), 72% (95% CI, 68.8% to 75.1%), and 64.9% (95% CI, 61.6% to 68.2%), respectively.Relapse incidence was not predicted by the score.Discrimination of the ADT model for 2-year OS outperformed the EBMT score and did not differ when compared with the Cox model (Data Supplement).

DISCUSSION
Eligibility of patients with AL for allo-HSCT is based on a riskbenefit assessment of the relapse risk versus transplantation risk. 25y applying the ADT algorithm, we have developed a novel predic-tion model on the basis of 10 variables for day 100 OM.Scores correlated with objectives, enabling an individual continuous probabilistic evaluation of the primary objective (ie, OM at day 100) and a discretized risk assessment of secondary objectives at 2 years (OS, NRM, and LFS).
Insights can be derived from the tree-like structure of the model and variable weights (Fig 1 and Table 2).Earlier years (200026,27 Earlier years ( to 2003) ) were associated with a worse outcome, reflecting advances in the field. 1 An advantage of the ADT is its ability to detect interactions.For instance, the effect of the interval between diagnosis and transplantation, with a cutoff of 142 days, had impact only for certain disease stages (ie, CR1 and advanced).Thus, specific characteristics of unique subpopulations were captured, and the cutoff set by the EBMT score of 1 year for all disease stages was refined. 39][30] Not surprisingly, reduced-intensity conditioning was a favorable prognostic factor when compared with MAC in older patients (age Ն 37 years), corroborating interactions between age and conditioning.Interestingly, age was not an independent risk variable.It seems that transplantation practice and patient selection have downgraded age importance with respect to outcome. 31he ADT algorithm was able to detect variables associated with the primary outcome, assign weights, and ignore redundancies (eg, the recipient-donor CMV serostatus combination was selected, whereas individual CMV status, donor or recipient, was excluded).Body mass index and cytogenetics may play a role as prognostic factors, 5,32 but were not selected, possibly because of many missing values.Transplantation from a female donor to a male recipient has also been associated with mortality in previous studies, 3 but was not selected in the current study, because it mainly affects late mortality.Differences in variable selection compared with previous allo-HSCT prognostic studies probably reflect different measures of predictive importance assessment.Models augment, rather than contradict, one another.Their integration may lead to improved predictive accuracy.
The EBMT score is a well-recognized tool for adjusting transplantation analysis.The ADT model showed improved discrimination, although relatively small, in comparison to the EBMT score (AUC, 0.701 v 0.646; P Ͻ .001).Nevertheless, one must keep in mind that the EBMT score was designed for prediction of long-term survival; thus, comparison with our score is not trivial, because primary end points differ.[5]7 Moreover, when contemplating a transplantation, one must take into account specific patient history (eg, the interval from diagnosis to transplantation does not have the same impact in CR1 or CR2). 33Such interactions are not necessarily captured by standard statistical models.
Stratifying OM risk at 100 days by collapsing score intervals into different risk groups would lead to loss of important clinical information.By providing a continuous measure for patient risk, we transformed the prediction problem from a classification task to a regression task, and we enhanced physician and The alternating decision tree score calibration plot.Mean predicted probability of overall mortality at day 100 for each categorized score interval was plotted against observed proportions of events.R, correlation coefficient.patient understanding regarding expected transplantation hazard.Two-year outcomes, which have previously been shown to predict long-term survival, 34 were estimated according to the score (by deciles) established for day 100 OM.Thus, potential use was extended, and factors predicting day 100 OM may be surrogates for long-term survival.
The ADT is a classification algorithm designed for handling binary end points, but not censored or continuous end points.Therefore, we focused on a short-term outcome, in a population in which loss to follow-up was lower than 5% and center effect is unlikely, because transplantation volume was not linked to patient loss (Data Supplement).Patients lost had some differing characteristics (Data Supplement); however, given their relatively small number, they are not likely to affect model performance.An important aspect of the current study is the introduction of an alternative approach for prediction model development, rather than comparison with the conventional approach.A DM method has been applied in fields such as communication and finance, and one can think of potential uses in HSCT, because it allows prediction of the outcome of interest without strong assumptions regarding the distribution of the variables and the regression model used. 11,12It is reassuring that the Cox and ADT models achieved similar discrimination, stressing the validity of the DM method with short-term transplantation data.Nevertheless, alternative methods should be explored for modeling long-term outcomes. 35,36his study has several limitations.First, it is a retrospective analysis susceptible to data selection and measurement biases. 37However, the registry analyzed reflects real world data, conveying contemporary practice. 10Second, validation was done on an internal data set, and external validation is warranted.Nonetheless, the many patients in the analysis, the use of 10-fold cross validation for training in addition to a separate validation set, and the model's excellent calibration, all greatly enhance validity and robustness.Moreover, despite lack of prediction model development guide-lines, we adhered to strict methodologic principals. 23Third, in contrast to the EBMT and HCT-CI scores, which are not disease specific, our score applies only to ALs, which are a leading indication for allo-HSCT transplantation 1 ; thus, targeting this patient population is reasonable.Still, the diagnosis variable had low predictive influence, suggesting that the score may be applicable to other diseases.Fourth, given the ADT model complexity, calculation of patient score is nontrivial, as opposed to the EBMT score. 3herefore, we provided an online interface (http://bioinfo.lnx.biu.ac.il/ ϳbondi/web1.html)to enable easy calculation.Finally, the primary objective focused on short-term survival.Nevertheless, our model showed competence in predicting NRM, OM, and LFS at 2 years.In addition, the high rate of day 100 OM (13.9%) highlights its importance as a valid objective.
In conclusion, we present a machine learning-based prediction model for mortality after allo-HSCT.The model was developed using a DM approach and internally validated on a large data set with excellent calibration.It can be readily used online and provides a personalized estimation of day 100 OM risk and a discretized estimation of long-term outcomes, and at the same time reveals variables' interactions.The model's potential applications include pretransplantation risk assessment and stratification, patient counseling during informed consent sessions, and tailoring transplantation regimens or referring to alternative treatments according to transplantation risk.Predictive accuracy is still not optimal.Integration with the HCT-CI score, detailed data on modifiable therapeutic factors, and data on somatic mutations (eg, Fms-like tyrosine kinase 3 and Nucleophosmin 1) may further enhance predictive power and aid treatment personalization.Having demonstrating that the DM approach can be applied to the EBMT registry data, future studies must aim to make more precise predictions for long-term outcomes using the recent methods developed to manage censored data. 35,36ig 3 .Kaplan-Meier curves of overall survival stratified by the categorized alternating decision tree score.Higher scores (ie, higher score interval number) resulted in lower probability of survival.Calibrated score intervals are described (see Table 3).
depicts the graphical output of the ADT prediction model.The ADT algorithm selected 10 of 20 variables (Fig 1 and Fig 2.The alternating decision tree score calibration plot.Mean predicted probability of overall mortality at day 100 for each categorized score interval was plotted against observed proportions of events.R, correlation coefficient.

Table 1 .
Patient Characteristics

Table 3 .
Score Intervals and Associated Outcomes on the Validation Set