Prediction of acute myeloid leukaemia risk in healthy individuals

The incidence of acute myeloid leukaemia (AML) increases with age and mortality exceeds 90% when diagnosed after age 65. Most cases arise without a detectable prodrome and present with the acute complications of bone marrow failure1. The onset of such de novo AML cases is typically preceded by the accumulation of somatic mutations in pre-leukaemic haematopoietic stem and progenitor cells (HSPC) that undergo clonal expansion2,3. However, recurrent AML mutations also accumulate in HSPCs during ageing of healthy individuals who do not develop AML, a phenomenon referred to as age-related clonal haematopoiesis (ARCH),4–8. To distinguish individuals at high risk of developing AML from those with benign ARCH, we undertook deep sequencing of genes recurrently mutated in AML in the peripheral blood cells of 95 individuals sampled on average 6.3 years before AML diagnosis (pre-AML group), together with 414 unselected age- and gender-matched individuals (control group). Pre-AML cases were distinct from controls with more mutations per sample, higher variant allele frequencies (VAF) reflective of greater clonal expansion, and enrichment for mutations in specific genes. Genetic parameters were used to derive a model that accurately predicted AML-free survival; this model was validated in an independent cohort of 29 pre-AMLs and 262 controls. Since AML is rare, we also developed an AML predictive model using a large electronic health record database that identified individuals at greater risk. Collectively our findings provide a proof-of-concept that it is possible to discriminate ARCH from pre-AML many years prior to malignant transformation. This could in the future enable earlier detection, monitoring and potentially inform intervention.

(P<2.2 x 10 -16 , two-sided Fisher's exact test) (Fig. 1a), consistent with data from a study 131 of >2,000 unselected individuals assayed using a similarly sensitive method 9,10 . 132 Additionally, 39% of pre-AMLs above the age of 50 harboured a driver mutation with 133 VAF>10%, compared to only 4% of controls, a prevalence in keeping with the largest 134 studies of ARCH in the general population 4 (P<2.2 x 10 -16 , two-sided Fisher's exact test; 135 Extended Data Fig. 1). 136 The median number of ARCH-PD mutations per individual increased with age and was 137 significantly higher in the pre-AML group relative to controls (Fig. 1b, Supplementary  138 Table 2). Furthermore, examination of ARCH-PD VAF distribution revealed 139 significantly larger clones among the pre-AML cases (P=1.2 x 10 -13 , two-sided Wilcoxon 140 rank sum test, Fig. 1c). To gain insight into clonal growth dynamics, we examined 141 serially collected samples available for a subset of the VC. We did not find significant 142 differences in clonal expansion rates between pre-AMLs and controls (Extended Data 143 Fig. 2 a,b), although this may in part reflect the shorter follow-up of pre-AMLs, small 144 sample size and large variance in growth rates (Extended Data Fig. 2c). The observed 145 differences between pre-AMLs and controls may arise through cell-intrinsic or -extrinsic 146 factors. Although these variables have not been studied adequately in ARCH, a number 147 of observations in different contexts such as aplasia, post-chemotherapy and advanced 148 age have shown that increased clonal fitness is associated with distinct mutations 149 depending on context 10-12 . Interestingly, mutations in splicing factor genes were 150 significantly enriched among the pre-AMLs relative to the controls (odds ratio, 17.5; 95% 151 CI,, P=5.2 x 10 -16 , two-sided Fisher's exact test) and presented in significantly 152 younger individuals (median age 60.3 vs 77.3 years, P=1.7 x 10 -4 , two-sided Wilcoxon 153 rank sum test, Fig. 2a). Previous work suggests that spliceosome mutations appear to 154 confer competitive advantage in the context of ageing 10 . Hence it is possible that the 155 significantly higher prevalence of such clones in younger pre-AMLs may reflect extrinsic 156 selection pressures rather than earlier mutation acquisition. 157 In keeping with previous reports 9 , we found that DNMT3A and TET2 were the most 158 commonly mutated genes in both groups (Fig. 2b). We could not identify any NPM1c or 159 FLT3-internal tandem duplication (ITD) mutations, consistent with these arising late in 160 leukaemogenesis 10,13 . Recurrent CEBPA mutations, which are implicated in ~10% of de 161 novo AML 14 , were also absent, suggesting that driver events in this gene may also be late 162 events in AML evolution. In order to quantify the impact of different mutations on the 163 likelihood of progression to AML, we ranked ARCH-PD mutations based on the number 164 of times they have been reported in COSMIC among individuals with haematological 165 malignancies 15 . We found that mutations that are highly recurrent in cancer specimens 166 were more common in pre-AMLs than in controls with ARCH-PD, whereas driver events 167 in the controls tended to affect loci that are less frequently mutated in haematological 168 malignancies and occurred at significantly lower VAF (Fig. 2c, Fig. 2d). Overall, these 169 findings demonstrate notable differences in the mutational landscape of ARCH and pre-170 AML. Moreover, this work, in conjunction with recent insights into the origins of AML 171 relapse 16 , suggests that AML progression typically occurs through many years of pre-172 leukaemic HSPC clonal evolution before acquisition of late mutations leads to overt 173 malignant transformation. 174 Based on these findings we next developed an approach to quantify the relative 175 contributions of driver mutations and clone sizes to the risk of progressing to AML. We 176 tested different regularised logistic and Cox proportional hazards regression approaches, 177 which achieved similar performance on both DC (Concordance, C = 0.77 ± 0.03) and VC 178 (C = 0.84 ± 0.06) (Extended Data Fig. 3 and Extended Data Fig. 4, Supplementary Table  179 3). Models trained on only DC or VC had similar coefficients, thereby justifying 180 combining the data sets for a more accurate analysis of the individual risk contributions 181 (C = 0.77 ± 0.03, AUC=0.79, Supplementary Table 3). Quantitatively, we found that 182 driver mutations in most genes conferred an approximately two-fold increased risk of 183 developing AML per 5% increase in clone size (Fig. 3a, Supplementary Table 3). Notable 184 exceptions to this trend are the most frequently mutated ARCH genes, DNMT3A and 185 TET2, which confer a lower risk of AML progression (Fig. 3a,b, Supplementary Table 3). 186 Conversely, a larger effect size was apparent for TP53 (hazard ratio HR=12.5, 95% CI 187 5.0-160.5) and U2AF1 (HR=7.9, 95% CI 4.1-192.2) mutations ( Fig. 3a,b). However, we 188 note that other ARCH-PD genes such as SRSF2 can contribute a similar relative risk due 189 to their presence at higher VAF in pre-AMLs (Fig. 3a,Extended Data Fig. 5a,190 Supplementary Note). Of note, mutations in TP53 and spliceosome genes (including 191 U2AF1) are also associated with a poorer prognosis in AML 14 . As the risk of each 192 ARCH-PD mutations is deleterious and the effect of multiple mutations present in the 193 same individual is multiplicative, a higher number of mutations is predicted to increase 194 AML progression risk (Fig. 3c). Likewise, the size of the largest driver clone was also 195 strongly associated with AML progression risk in agreement with the risk of individual 196 mutations generally being proportional to VAF (Fig. 3c). Collectively, whilst the VAF 197 and mutation number confer much of the predictive value, this model does demonstrate 198 distinct gene-level risk factors, and is able to quantify the cumulative impact of multiple 199 mutations and clonal size on the likelihood of progression to AML. 200 Although our predictive model performs well in identifying those at risk of developing 201 AML in our experimental cohorts, AML incidence rates in the general population are low 202 (4:100,000) 1 , and thus millions of individuals would need to be screened to identify the 203 few pre-AML cases, with many false positives. We therefore sought to determine 204 whether routinely available clinical information could improve prediction accuracy or 205 identify a high-risk population for targeted genetic screening. We first analysed complete 206 blood count (CBC) and biochemistry data available for 37 of the pre-AMLs and 262 207 controls. As reported previously 5,10,17 , ARCH-PD was overwhelmingly associated with 208 normal blood counts and this was also the case for pre-AML cases, indicating that these 209 did not represent undiagnosed myelodysplastic syndromes (MDS) 18 . We identified a 210 significant association between higher red cell distribution width (RDW) and risk of 211 progression to AML (P=0.0016, Wald Test with Bonferroni multiple testing correction, 212 Fig. 3d). Although traditionally used in the evaluation of anaemia, raised RDW has been 213 correlated with inflammation, ineffective erythropoiesis, cardiovascular disease and 214 adverse outcomes in several inflammatory and malignant conditions 19 . The correlation 215 between RDW and AML risk remained highly significant when controls without ARCH-216 PD were excluded from the analysis (P=3.5 x 10 -06 , Wald test with Bonferroni multiple 217 testing correction, Extended Data Fig. 5b). Higher RDW has previously been associated 218 with ARCH and overall mortality 5 , but has never been shown to distinguish ARCH from 219 pre-leukaemia. In order to verify RDW as a predictive factor and determine whether 220 additional clinical parameters are associated with AML risk, we studied the Clalit 221 database 20 , which contains the electronic health records (EHR) with an average of 3.45 222 million individuals per year collected over a 15-year period 21 . We identified 875 AML 223 cases using stringent criteria based on diagnostic codes and treatment records (Extended 224 Data Fig. 6, Supplementary Table 4). Analysis of RDW trends revealed significantly 225 raised measurements several years prior to AML diagnosis relative to age and sex-226 matched controls (Fig. 4a). Additional parameters that correlated with AML risk included 227 reductions in monocyte, platelet, red blood cell and white blood cell counts, albeit usually 228 remaining above the thresholds for clinically relevant cytopenias 18 (Fig. 4a, Extended 229 Data Fig. 7). These findings suggest that evolving de novo AML may sometimes have a 230 significant prodrome with subtle but discernible clinical manifestations. We next applied 231 a machine learning approach to construct an AML prediction model based entirely on 232 routinely documented EHR variables (Extended Data Fig. 8, Supplementary Table 4). 233 This model was able to predict AML 6-12 months prior to diagnosis with a sensitivity of 234 25.7% and overall specificity of 98.2%. The model performed consistently across 235 different age groups with an increased relative risk of 28 and 24 for males and females, 236 respectively, between the age of 60 and 70 years (Fig. 4b). To better understand which 237 patients are most likely to be accurately classified by this model, we compared absolute 238 lab values between true positives (TP) and false negatives (FN). We found that 35.5% of 239 FN predictions were for patients associated with infrequent blood count data (Extended 240 Data Fig. 9). Some of the TP cases had mildly abnormal blood counts that would not 241 initiate a diagnostic work-up (Fig. 4c), and cytopenias that would be compatible with 242 undiagnosed MDS 18 were uncommon. 243 Collectively, our findings provide new insights into the pre-clinical evolution of AML 244 and support the premise that individuals at high risk can be identified years before they 245 develop overt disease. To this end, we present two distinct models for the prediction of de 246 novo AML: one based on somatic point mutations and the other on routinely documented 247 clinical information. We find that basic clinical and laboratory data can identify a high-248 risk subgroup 6-12 months before AML presentation, whilst genetic information can 249 identify a substantial fraction of cases several years to more than a decade before 250 diagnosis. By characterising features that distinguish benign ARCH from pre-leukaemia, 251 our models give valuable insights into leukaemogenesis. It is evident from the current 252 study, together with our recent analysis of mutation acquisition from pre-leukaemic 253 development through to relapse 16 , that long-term pre-leukaemic HSPCs frequently carry 254 mutations and undergo significant clonal expansion whilst retaining differentiation 255 capacity for years before AML diagnosis. Furthermore, it is clear that some mutations, 256 particularly those affecting TP53 and U2AF1, impart a relatively high risk of subsequent 257 AML, while mutations in other genes, for example DNMT3A and TET2, confer a lesser 258 risk of malignant transformation. Previous studies suggest that oncogenic mutations in 259 TP53 and spliceosome genes confer little or no competitive advantage in the absence of 260 particular selective pressures 22,23 , indicating that cell-extrinsic factors may be important 261 determinants of clonal trajectory. 262 Cancer predictive models have enabled successful early detection and intervention 263 programmes for several solid tumours 24,25,26 . However, screening tests are unavailable for 264 the sub-clinical stages of most haematological malignancies. Our study provides proof-265 of-concept for the feasibility of early detection of healthy individuals at high risk of 266 developing AML, and is a first step in the design of future clinical studies to investigate 267 the potential benefits of early interventions in this deadly disease. However, the 268 infrequency of AML necessitates that future screening tests provide high sensitivity and 269 specificity. Our findings suggest that basic clinical data may identify a higher risk 270 population that might benefit from targeted genetic screening. Equally, combining 271 clinical and genetic information in a single model and including structural driver events is 272 likely to improve model accuracy further. Nevertheless, establishing the utility of such a 273 tandem approach will require extensive clinical and genetic analysis on the same 274 population cohort, in a prospective setting. Furthermore, ARCH is associated with several 275 non-malignant conditions 4,5 , and may play a causal role in cardiovascular disease 27,28 . 276 Hence genetic testing for ARCH may also prove useful in the management of common 277  Shlush laboratories for comments and Tom Hudson for early study planning. We thank 438 Gabi Barabash for organising the Clalit dataset collaboration. The EPIC study centers 439 were supported by the Hellenic Health Foundation, Regional Government of Asturias, the   individuals were selected for the control group, as they did not develop any 552 hematological disorders during the average follow-up period of 11.6 years (IQR=2.1 553 years). The median age at recruitment was 56.7 years (range, 36.08 to 74.42). In order to 554 minimize any possible demographic biases, an approximate 1:4.5 pre-AML to control 555 ratio was maintained across the different centres. 556

Validation cohort 557
Samples were ascertained from individuals enrolled in the EPIC-Norfolk longitudinal 558 cohort study between 1994 and 2010. Samples and clinical metadata were available from 559 37 AML patients (of which 8 were already included in the DC) and 262 age-and gender-560 matched controls without a history of cancer or any haematological condition. The 561 average time between the first blood sampling and AML diagnosis was 10.5 years (IQR 562 8.3 years). The averge follow-up period for the control cohort was 17.5 years (IQR 3.8 separation. The captured DNA on beads was resuspended in 40μl of Nuclease-Free water 598 before dividing the total volume into 2 PCR tubes and subjecting the libraries to 10 599 Both files were then analysed to detect SNVs and small indels using Varscan2 35 . In order 628 to further remove sequencing artifacts and improve sensitivity, we applied a two-step 629 polishing statistical approach that models the error rate at each sequenced genomic 630 position. For both steps, bam1 was used and all the samples except the sample being 631 investigated were included for error rate modelling. At step one, as previously 632 described 31 , the error rates were modelled by fitting weibull distribution curves to the 633 non-reference allele fractions. SNVs with allele fractions that were statistically 634 distinguishable from the background error rates (P=0) were further analysed. At Step 2, 635 the coverage of the non-reference allele fractions was considered by using linear line 636 fitting that describes the negative correlation that exist between the log (non-reference 637 allele fraction) and the corresponding log(coverage) values. This allowed us to estimate 638 different error rates at different coverage depths. As indel errors are rare and cannot be 639 appropriately modelled by the same statistical framework they were called using barcode 640 mediated error correction alone. At least 10 CR, 5 supporting reads on the forward strand, 641 5 supporting reads on the reverse strand, and 2 DR were required to call an indel. 642 Additional post-processing steps applied to data from both the DC and VC are detailed in 643 section 3.3. Variants were annotated using Annovar 36 . 644

Validation cohort variant calling 645
Sequencing reads were aligned to the reference genome (GRCh37d5) using the Burrows-646 Wheeler aligner (BWA-aln). Unmapped reads, PCR duplicates and reads mapping to 647 regions outside the target regions (merged exonic regions + 10bp either side of each 648 exon) were excluded from analysis. Sequencing depth at each base was assessed using (https://github.com/cancerit/cgpPindel) was applied. We additionally used the 689 aforementioned deepSNV algorithm in order to increase sensitivity for indels present at 690 low VAF. VAF correction was performed using an in-house script 691 (https://github.com/cancerit/vafCorrect ). 692 Post-processing filters required that the following criteria were met for a variant to be 693

Statistical analysis 756
All statistical analyses were performed in the R statistical programming environment. 757 two-sided Wilcoxon rank sum test was used to assign significance level for differences in 758 1) the median number of somatic mutations among the pre-AML and control groups 2) 759 the median VAF of mutations among groups.
3) The age of individuals with spliceosome 760 mutations. Fisher's exact test was used to assign significance to differences in the 761 prevalence 1) of ARCH among the groups. 2) spliceosome mutation in the pre-AML 762 group 763 764 6. Predictive modelling 765

Cox proportional hazards model with random effects 766
We used a Cox proportional hazards regression to model AML progression-free survival 767 as previously described 14 . We used random effects for the Cox proportional hazards 768 model in the CoxHD R package (http://github.com/gerstung-lab/CoxHD). A key strength 769 of this approach is the ability to include many variables in one model while shrinking 770 estimated effects for parameters with weak support in the data, thus controlling for 771 overfitting. We used weighting to minimise the biases introduced by the artificial case-772 control ratio 52,53 and calculated hazard ratios relative to the (approximate) true cumulative 773 incidence of about 1-3/1,000 in the given age range over a follow up of 10-20 years. The 774 observed driver mutation frequency and VAF in pre-AMLs closely resembled values 775 expected based on the estimated risks, indicating that risk model and driver prevalence 776 are well aligned (Extended Data Fig. 4)  cancer-type/leukaemia-aml/incidence. All-cause mortality data was obtained from the 799 office of national statistics (https://www.ons.gov.uk/). 800 801

Ridge regularised logistic regression 802
Using the same covariates as in 6.1.1 we fitted ridge regularised logistic regression model 803 to dichotomised outcome data. While logistic regression is a common choice for case-804 control analyses, a downside of this approach is the inability to explicitly use time-805 dependent covariates. The penalty parameter was chosen using LOOCV on the full 806 cohort; this value was then used on the DC and VC to yield the same scaling of 807 coefficients. Confidence intervals were calculated using 100 bootstrap samples. Fitting 808 was performed using the glmnet R package. AUC as the primary performance metric was 809 calculated using the ROCR R package. 810 811

Regression models 812
Two predictive models were developed. 813 Model 1: This model performs logistic-regression based prediction using four types of 814 features: (1) gender; (2) age at blood sampling; (3) the sum of the VAFs ARCH-PD 815 reported in COSMIC v80 to be recurrent (at least 2 case reports in haematopoietic and 816 lymphoid tissues); and (4) somatic mutation burden of selected genes, where each gene 817 was represented by the sum of the VAFs corresponding to ARCH-PD mutations in that 818 gene. We measured the predictive performance of each gene via the AUC obtained in a 5-819 fold cross validation when using only the gene as a predictive feature, and only retained 820 genes with AUC>55% in the final model. 821

822
Model 2: We applied Lasso regression as implemented in the glmnet R package, while 823 enabling leave-one-out cross-validation to fit a Cox regression model. A minimal subset 824 of ARCH-PD variants was selected whose respective weighted combined VAFs were 825 highly predictive of AML development in the training set. Scores were calculated for 826 each patient as a linear combination of VAF of mutations weighted by regression 827 coefficients that were estimated from the training data. As most scores were zero in the 828 training subset, non-zero scores were discretised to take on a value of 1 that corresponds 829 to AML prediction. 830 Models 1 and 2 were trained on the DC and tested for their association with AML 831 development using the VC data. Survival analysis was performed using the Kaplan-832  Table 4). For each lab measurement, we used median 902 age/gender normalised test values per patient in three time windows ranging from 6-12 903 months prior to onset, 1-2 years prior to onset and 2-3 years prior to onset. In addition, 904 we compute the slope of the normalised lab measurements in the 6 -12 month time 905 window using a linear regression model. 906 Diagnosis features: of the 1780 different major ICD-9 diagnosis codes, we select only 907 diagnosis which were previously observed in at least 10 different cases and have an 908 increased relative risk for AML > 2 fold (as observed on the training set, Supplementary 909 Table 4). For each diagnosis code, we mark whether it appeared in each of the patients in 910 time intervals of 6 months -3 years, and 3-5 years prior to onset. 911 -BMI features: for each patient in the cohort we extracted median BMI, weight and 912 height as measured in time intervals of 6 months to 2 years, and 2-3 years prior to onset. 913 Gradient boosting: We used the R package xgboost to infer parameters for a classifier 914 given cases and controls. Objective was set to binary:logistic, the evaluation metric to 915 AUC. We set nrounds=5000, eta = 0.001, gamma = 0.1, lambda=0.01, alpha=0.01, 916 max_depth=6, min_child_weight=2, subsample=0.7 and colsample_bytree=0.7. The 917 boosting algorithm reports a function f that computes a predictive score given the 918 features. Given a threshold T the expression f(patient features)>T defines a classifier. To 919 standardise thresholds we estimate quantiles for the scores on the train set T(p) = 920 quantile(f(train),p) and define the classifier for specificity level p as f(patient features) > 921 T(p) (Supplementary Table 4). 922 Cross validation and relative risk evaluation. To evaluate the predictive value of 923 classification scheme while considering the strong age and gender biases in the incidence 924 of AML, we performed five-fold cross validation after splitting the cases and controls to 925 five age and gender matched groups. For each fold, we sampled 100,000 controls and 926 combined with the cases, constructed the feature set and trained the model. The model 927 was then tested on the fold cases along with 200,000 sampled controls. We used 928 standardised classifier parameters and standardised thresholds as inferred based on each 929 train set to generate a series of classifications on each test set and merged these based on 930 the control quantiles in the test as described above. Given a threshold p to define high 931 and low prediction score, we counted for each bin b that defines a patient in a specific age 932 (<40, 40-50, 50-60,60-70,70-80,>80)

942
The relative risk in the bin is defined by rr b = r b abs,high/ r b abs , where the sensitivity level for 943 the classifier threshold level is defined as sense b = N b