Final Degree Project Biomedical Engineernig Degree “Exploring machine learning approaches for phenotype prediction of Huntington's disease’’ Barcelona, 5th of May, 2024 Author: Caterina Fuses i Kuzmina Director: Jordi Abante Llenas Tutor: Josep Maria Canals Coll Abstract Huntington’s disease onset of symptoms is clinically predicted primarily using the length of the CAG trinucleotide expansion in the HTT gene. However, this prediction can only explain around 50% of the variability of the phenotype. It is estimated that 40% of the remaining variability is heritable, suggesting the presence of other genetic factors. Genome Wide Association Studies (GWAS) have identified potential genetic modifiers, although only through the revelation of linear effects and via computationally demanding processes. This project benchmarks various machine learning algorithms trained with an Enroll- HD GWAS dataset to predict the age at HD onset. The dataset comprises the geno- type of millions of SNPs from approximately 9,000 individuals. The models considered include regularized linear models (Lasso and Elastic Net) and tree-based models (Ran- dom Forest and XGBoost), and their predictive power is compared to an Ordinary Least Squares baseline model trained solely with sex and CAG as covariates. The re- sults indicate that tree-based models achieve the best estimation of age of onset (AO), improving the prediction by 3% with respect to the baseline, possibly due to their im- plicit consideration of interactions between SNPs. For each model, we extract the most significant features contributing to the model, thereby identifying genetic modifiers. Some of these key SNPs are in well-known AO modifier candidates such as FAN1 and MYT1L, while others are in genes like CDYL2 proposed as new candidates. Keywords: Machine Learning, High Dimensional Data, Huntington’s Disease, Single Nucleotide Polymorphism. i Acknowledgements The most important person to acknowledge in this project is my director, Jordi Abante. The project originated from his innovative idea for a novel research path to develop a better AO prediction algorithm. I am deeply honored to have contributed to the early stages of this endeavor with this initial exploration. My utmost gratitude goes to him for his constant guidance and indispensable insights at every step. I have enjoyed working with him enormously and am very grateful to have completed this project under his supervision. ii List of Figures 1 Inverse correlation of age of onset and CAG repeat length observed in the Enroll-HD dataset used in this project. . . . . . . . . . . . . . . . . 1 2 Cumulative probability of onset of HD for various CAG lengths based on the Langbehn model. Figure by Langbehn et al. [30]. . . . . . . . . 6 3 Mean as estimated by various published formulae. Figure by Langbehn, Hayden, Paulsen et al. [31]. . . . . . . . . . . . . . . . . . . . . . . . . 7 4 Project’s general workflow. . . . . . . . . . . . . . . . . . . . . . . . . . 10 5 Data preprocessing flow chart. . . . . . . . . . . . . . . . . . . . . . . . 19 6 Graphical normality tests. . . . . . . . . . . . . . . . . . . . . . . . . . 25 7 Comparison of least squares method prediction (Lasso), top image, and tree-based method prediction (Histogram XGBoost), bottom image. . . 33 8 Manhattan plots of a least squares method (Lasso), top image, and a tree-based method (Histogram XGBoost), bottom image. . . . . . . . . 34 9 AO over CAG repeat length of 500 random samples, colored by the pres- ence of an alternative allele in SNPs rs61997076 (left) and rs144287831 (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 10 AO over CAG repeat length of 500 random samples, colored by the pres- ence of an alternative allele in SNPs rs118089305 (left) and rs10169129 (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 11 AO over CAG repeat length of 500 random samples, colored by the presence of an alternative allele in rs141338757. . . . . . . . . . . . . . 37 12 GWAS by Lee et al. [8], showing significant associations between AO and different chromosome loci. . . . . . . . . . . . . . . . . . . . . . . . 37 13 GO terms proportions of the features used in each model plus the back- ground proportions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 14 WBS diagram of the project. . . . . . . . . . . . . . . . . . . . . . . . . 39 15 PERT diagram of the project. . . . . . . . . . . . . . . . . . . . . . . . 43 16 GANTT diagram of the project schedule, starting the 29th of January and ending the 28th of May. . . . . . . . . . . . . . . . . . . . . . . . . 44 17 SWOT analysis of the project. . . . . . . . . . . . . . . . . . . . . . . . 45 A.1 Age distribution in training and testing sets. . . . . . . . . . . . . . . . 57 A.2 Number of samples in training and testing sets separated by sex. . . . . 57 A.3 CAG repeat length distribution in training and testing sets separated by sex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 B.1 Least squares methods predictions over training set. . . . . . . . . . . . 59 B.2 XGBoost predictions over training set. . . . . . . . . . . . . . . . . . . 59 B.3 Random Forest predictions over training set. . . . . . . . . . . . . . . . 60 B.4 Elastic Net predictions of testing set. . . . . . . . . . . . . . . . . . . . 61 B.5 Random Forest predictions of testing set. . . . . . . . . . . . . . . . . . 61 B.6 Approx XGBoost predictions of testing set. . . . . . . . . . . . . . . . . 62 C.1 Manhattan Plot of Elastic Net coefficients. . . . . . . . . . . . . . . . . 63 C.2 Manhattan Plot of Random Forest. . . . . . . . . . . . . . . . . . . . . 64 C.3 Manhattan Plot of Approx XGBoost. . . . . . . . . . . . . . . . . . . . 64 iii List of Tables 1 Classification of CAG expansion alleles in HTT [16]. . . . . . . . . . . 4 2 Included GO terms as core genes. . . . . . . . . . . . . . . . . . . . . . 20 3 Grid search parameter values for trained models. All parameters not stated here were left with the default value. . . . . . . . . . . . . . . . 26 4 Hyperparameter values of the final models found using cross-validation. 31 5 Model metrics, execution time and number of features used, including the baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6 Activity table for the PERT diagram. . . . . . . . . . . . . . . . . . . . 43 7 Lasso top 30 SNPs by feature importance. . . . . . . . . . . . . . . . . 65 8 Elastic Net top 30 SNPs by feature importance. . . . . . . . . . . . . . 66 9 Random Forest top 30 SNPs by feature importance. . . . . . . . . . . . 67 10 Approx XGBoost SNPs by feature importance. . . . . . . . . . . . . . . 68 11 Hist XGBoost SNPs by feature importance. . . . . . . . . . . . . . . . 69 12 GO enrichment analysis results: odds ratio for each comparison against the background counts obtained with a Fisher test. . . . . . . . . . . . 70 13 GO enrichment analysis results: p-values for each comparison against the background counts obtained with a Fisher test. . . . . . . . . . . . 70 14 Counts of GO terms amongst each model features, including the counts in the training feature matrix (background). Table used in the GO en- richment analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 iv List of Abbreviations HD Huntington’s Disease AO Age of Onset GeM Genetic Modifier ML Machine Learning SNP Single Nucleotide Polymorphism MSN Medium-sized spiny neuron GWAS Genome-Wide Association Studies GeM-HD Genetic Modifiers of HD LD Linkage Disequilibrium MRI Magnetic Resonance Imaging RSS Residual Sum of Squares GLM Generalized Linear Model CLT Central Limit Theorem OLS Ordinary Least Squares MAE Mean Absolute Error MSE Mean Squared Error Rsid Reference SNP cluster ID CSR Compressed Sparse Row WBS Work Breakdown Structure PERT Program Evaluation and Review Technique GDPR General Data Protection Regulation SaMD Software as a Medical Device v Contents Abstract i Acknowledgements ii List of figures iii List of tables iv List of abbreviations v 1 Introduction 1 1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background 3 2.1 Huntington’s disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 SNP genotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Single nucleotide polymorphisms . . . . . . . . . . . . . . . . . 5 2.2.2 Genome-wide association studies . . . . . . . . . . . . . . . . . 5 2.2.3 Linkage disequilibrium . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 State of the art: Phenotype prediction in HD. . . . . . . . . . . . . . . 6 2.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Market analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5.1 ML in disease phenotype prediction . . . . . . . . . . . . . . . . 8 2.5.2 Enroll-HD studies . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Concept engineering 10 3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.1 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.2 Feature scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.3 Outcome scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 ML models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 Regression models . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 ML packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.3 Data sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.4 Model metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Detailed engineering 19 4.1 Core gene selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 Core genes SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.1 Feature matrix assembly . . . . . . . . . . . . . . . . . . . . . . 22 4.2.2 Alternative variant prevalence filtering . . . . . . . . . . . . . . 23 4.2.3 Toy example matrix . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.4 AO normality check . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.1 Feature matrix loading . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.2 Scaling functions . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.3 Evaluating functions . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Results presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5 Results 31 5.1 Baseline model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 SNP models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.1 Metrics comparison . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.2 Feature importance . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.3 GO enrichment analysis . . . . . . . . . . . . . . . . . . . . . . 37 6 Execution schedule 39 6.1 Work Breakdown Structure . . . . . . . . . . . . . . . . . . . . . . . . 39 6.2 PERT diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.3 GANTT diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7 Technical viability 45 8 Economic viability 47 9 Regulatory and legal aspects 48 9.1 Enroll-HD data usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 9.2 Medical algorithms legislation . . . . . . . . . . . . . . . . . . . . . . . 48 10 Conclusions and future outlook 49 11 Bibliography 51 A Appendix A 57 B Appendix B 59 B.1 Training predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 B.2 Testing predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 C Appendix C 63 C.1 Manhattan Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 C.2 Feature importance tables . . . . . . . . . . . . . . . . . . . . . . . . . 65 D Appendix D 70 vii 1 Introduction Huntington’s disease (HD) is a hereditary neurodegenerative disease whose first symp- toms can appear at different points of a lifetime. The age of disease onset correlates strongly with the length of the mutation related to the disease, a CAG trinucleotide expansion in the huntingtin gene (HTT ) [1], [2]. Genetic testing and clinical prediction for age of onset (AO) both rely on the length of this expansion. Nevertheless, this is not a perfect predictor, as the standard deviation of AO at a specific CAG repeat length is quite large, specially for short expansions (see Figure 1). Expansion length accounts for 40-70% of the variability of AO. The remaining variance is considered to be due to environmental and other genetic factors, showing a high degree of heritability [3], [4]. Large genetic studies have been done during the last two decades and are still ongoing in the search for genetic modifiers (GeMs) of AO, genes or genetic elements involved in the process of disease onset either by accelerating or delaying the emergence of HD’s first symptoms [5]. The main AO modifier candidates are related to DNA maintenance machinery, through their mechanisms are still being studied [6]. Figure 1: Inverse correlation of age of onset and CAG repeat length observed in the Enroll-HD dataset used in this project. Phenotype prediction in HD is not only interesting from a clinical point of view for life planning addressed to mutation carriers. It can also provide valuable insight of the disease mechanisms that generate such phenotype, and apply this knowledge into the design of clinical trials, where the effect of putative modifiers of HD pathogenesis needs to be controlled for the effect of genetic background [7]. Knowing more about disease onset mechanisms can also reveal possible targets for treatment that could delay onset of symptoms [2]. 1 Exploring ML approaches for phenotype prediction of HD 1.1 Objective The main research objectives of the thesis, by order of relevance, are: 1. Benchmark different machine learning (ML) regression models on how well can they predict AO in HD using SNP genotyping data. 2. Identify which genetic variants are most important in predicting AO and which processes are they involved in. 1.2 Methodology The workflow of the project can be divided into four main parts, which are not strictly sequential. First, a comprehensive literature review is conducted to understand the background and context in which the project is situated. This is followed by data preprocessing, where the raw data is tailored and prepared for model training. Subse- quently, ML models are selected and trained to analyze the preprocessed data. Finally, the results are discussed, offering insights into the performance of the models and their implications for the study. Models are going to be trained with data provided by Enroll-HD, genetic data contain- ing the genotype of millions of Single Nucleotide Polymorphisms (SNP) for thousands of HD patients. The dataset was assembled by Lee et al. [8] in June 2023. Data handling will be done with different programming languages, depending on the specific needs (e.g., Python, R, bash, C++). The models themselves will be coded and trained using Python and several open-source packages for easy implementation of ML algorithms. 1.3 Scope This project is an exploratory approach to phenotype prediction. The use of (ML) is not intended to create a new prediction tool but to review how well existing models can predict the onset of the disease based solely on genetic data. The results obtained with the final trained models will be compared to published studies that propose candidates for HD GeMs. This comparison will serve as a check for positive control discoveries and provide an alternative method for GeM discovery. 1.4 Limitations The limitations of the project stem from several challenges across its various stages. First, the vast amount of data requires significant computational resources for pre- processing and analysis, which might not be readily available. Secondly, the choice of feature engineering methods, such as treating each SNP individually or aggregat- ing them at the gene level, presents a trade-off between specificity and dimensionality reduction, potentially masking important polymorphisms when reducing dimensions. Additionally, selecting the right subset of SNPs impacts the discovery potential and computational feasibility. The high dimensionality of data also impacts the training of ML models, which can led to models which suffer from high variance. 2 2 Background 2.1 Huntington’s disease HD is an autosomal dominant neurodegenerative disease. At disease onset an irre- versible progression of motor, psychiatric and cognitive symptoms starts. As there is currently no cure, this progression inevitably spans over 15-20 years, ending with death. In most cases the first manifestations of HD appear in mid-life. Prior to this on- set, called “manifest HD”, a “prodromal” phase may happen during some years where more subtle symptoms start appearing [9]. During this time, the striatum suffers a progressive a neuronal loss, finally leading to the development of manifest HD. Clinical symptoms of manifested HD are very heterogeneous, even within families. The most distinct motor symptom is chorea: involuntary, excessive movements whose am- plitude increases with disease progression. At more advanced stages, bradykinesia and rigidity dominate the motor symptoms. Cognitive impairments that appear are related to visual attention, psychomotor speed, and visuomotor and spatial integration, exec- utive capacity, and short-term memory. The most common psychiatric symptoms are depression and anxiety, which can occur in premanifest HD, followed by apathy, irri- tability, and obsessive, compulsive thoughts and behaviours [9]. Pneumonia and suicide are the most common cases of death in HD [10]. HD pathology is cytologically located mainly in the striatum. Medium-sized spiny neu- rons (MSNs) of the striatum which use γ-aminobutyric acid (GABA) die, and cortical pyramidal neurons that project to the striatum and striatal neurons projecting to the substantia nigra degenerate [11]. The normal distribution of glial cells is also affected [12]. The overall brain volume is reduced, different brain compartments showing differ- ent rates of loss [13]. In 1993, the mutation related to this disease was identified in the huntingtin gene (HTT ) located in the short arm of chromosome 4: a CAG trinucleotide expansion in its exon 1, a trinucleotide which is transcribed into glutamine [14]. This finding indicated that genetic testing could be an accurate and specific diagnostic test for HD. The same study revealed that there was a wide range of AO for any specific repeat number, although larger expansions correlated with juvenile onset, extreme rigidity and more widespread neuropathology. Now it is estimated that only around 50-70% of variance of AO can be explained by CAG repeat length. The remainder is accounted for by environmental and individual genetic background [2], [3], [15]. The exon where the expansion mutation is located encodes a polyglutamine tract near the amino terminus, an alpha-helical solenoid-like scaffold. Repeats in this expansion up to 35 CAGs are found in the general population, while length polymorphisms ex- ceeding this number can cause HD, affecting the structure, phosphorylation pattern and activities of the protein [16]. Depending on the length of the expansion, the allele is usually classified into normal, reduced penetrance and full penetrance allele (see Table 1). This mutation is inherited following an autosomal dominant pattern, meaning that if 3 Exploring ML approaches for phenotype prediction of HD Allele classification nº CAG repetitions Expression Normal Allele <27 Not associated with a phenotype and are inherited in a stable manner. High Normal/ Intermediate Allele 27-35 Do not typically develop signs of HD but may show some degree of germline instability of the CAG repeat. Reduced Penetrance HD Allele 36-39 Associated with HD, but not all individuals with these alleles will develop clinical symptoms. Full Penetrance HD Allele >39 Almost always associated with development of clinical signs and symptoms of HD. Table 1: Classification of CAG expansion alleles in HTT [16]. either parent carries the CAG repeat expansion, the child has a 50% chance of inheriting it. It is also an unstable mutation: the length of the expansion can increase when passed from parent to child, being unstable in 80% of intergenerational transmissions. The higher size changes usually occur in the male germline [17]. Due to this instability, parents with CAG allele in the intermediate range which is not causing symptoms might pass a disease-causing range allele to their offspring. Decreases in size seldom occur. Expansion of HTT alleles can happen somatically as well as gametically, being prevalent in brain regions that are most susceptible to neurodegeneration [18]. Somatic expansion is predicted to accelerate the disease process [19], [20]. Backing the dominant behaviour, heterozygotes which have two expanded alleles with different sizes do not seem to follow a different pattern of onset than heterozygous when taking the larger allele for comparison. This means that the effects this mutation has are caused by the longest expanded allele, the dominant one [21]. However, there are some studies indicating that there could be an interaction between the two alleles: among HD subjects with large fully penetrant alleles, the length of the unexpanded CAG repeat was positively associated with a delayed HD onset [4]. Whether having two mutant alleles means displaying a more severe phenotype is still under discussion [2]. Upon further research on what other factors could be involved in HD manifestation and development, HD modifiers are still being discovered and described. Up until this moment, the majority of proven modifiers are DNA maintenance genes. The correla- tion between HD modifiers and the disease’s precise toxicity mechanisms are yet to be discovered [16]. The protein’s activity is altered in other neurological diseases apart from HD [22]. Its function is not completely known, but it is involved in brain development, transcription regulation, balance of histone acetylation and deacetylation, glial activation, mitochon- drial functions, axonal transport, signaling pathways regulatization, and autophagy[2]. The mutant HTT affects the regulation of transcription factors, impacts chromatin regulation, impairs mitochondrial functions, alters protein homeostasis, affects axonal 4 Exploring ML approaches for phenotype prediction of HD trafficking, dysregulates glutamatergic signaling, induces synaptic plasticity failure and glial activation [2], [23]. The mutant version of huntingtin is misfolded and forms ag- gregates with toxic properties, whereas in unaffected individuals it is localized in a diffused fashion around the cell cytoplasm [24]. Polyglutamine tracts with 37 or more glutamines (almost coinciding with the fully penetrant allele) self-assemble into such aggregates in experiments in vitro. The formation of this nucleus is dependant on time and protein concentration, and the rate of aggregate formation directly correlates with repeat length [25]. Mutant HTT also can co-aggregate with other proteins reduc- ing their availability. Both toxic gain-of-function of the mutant HTT and loss of the wildtype HTT functions contribute to the disease phenotype. 2.2 SNP genotyping 2.2.1 Single nucleotide polymorphisms A SNP is a described variation at a single position in a DNA sequence among individ- uals, only named as such if the variation is found in more than 1% of a population [26]. For every SNP there is a reference allele, which is the nucleotide present in the major- ity of the population, and one or multiple alleles, which are the minoritary nucleotides found in that position. 2.2.2 Genome-wide association studies The primary use of Genome-Wide Association Studies (GWAS) is to identify genetic variations associated with complex traits or diseases in populations. GWAS scan the entire genome of individuals to find variations such as SNPs that occur more frequently in people with a particular trait or disease (referred to as phenotype) compared to those without it. This way enhancers and suppressors of selected phenotypes are identified. Given the lack of variance explained in the phenotype by the CAG repeat length, the HD research community started the search for HD modifiers. When GWAS became technologically possible, large human molecular genetic tests enabled an unbiased search for modifier loci (position on a genome). The HDMAPS study by Li et al. (2003) [3] was a large linkage analysis on a genome scan which found evident linkage at chromosomes 4 and 6, near the localization of potential modifiers GRIK2. A combination of three GWAS carried out by the Genetic Modifiers of HD (GeM-HD) Consortium in 2015 identified important loci on chromosomes 8 and 15 that accelerated or delayed onset with respect to the mean [5]. Genes in these loci have been confirmed as HD modifiers years later through biological models: FAN1, in chromosome 15, a gene involved in DNA repair; and RRM2B, in chromosome 8, a ribonucleotide reductase [16]. Further studies revealed new modifier loci in other chromosomes; the most recent one by Lee, MacDonald and Gusella (2022) [8] has identified modifier loci in chromosome 7 too, with a greater power analysis by including motor and cognitive measures as additional phenotypes. The potential of such studies is still being exploited. 5 Exploring ML approaches for phenotype prediction of HD 2.2.3 Linkage disequilibrium An important fact to consider when relating SNPs to a certain phenotype is the linkage disequilibrium (LD), the non-random association of alleles at two or more loci: the allele of one polymorphism in an LD block, also known as haplotype, can predict the allele of the other polymorphisms in the block [27]. The size of the LD blocks depends primarily on the recombination rate in the region where they are found [28]. 2.3 State of the art: Phenotype prediction in HD. HD prediction in presymptomatic patients started for research purposes, as studying the evolution of the disease in already manifested HD was an important research limitation. The discovery of the trinucleotide repeat expansion in 1993 gave the basis for a highly accurate and easy to perform test, without the need of family samples [14]. It then evolved into a clinical service accompanied by counselling and support protocols which were later on internationally standardized [29]. One of the most spread model for age at HD onset is the Langbehn formula, derived from a parametric survival model [30]. This survival model was created with CAG repeat length specific survival functions, that take into consideration not only the CAG length but the current age of the case at hand. The resulting model can be expressed through a formula with parameters age and CAG that computes the probability of surviving without neurological symptoms until at least the given age. For example, a 40 year old individual without symptoms and a CAG repeat size of 44 will have a probability of onset by age 50 years, 10 years into the future, of 0.7; in an individual of this same age but with a repeat size of 44, the probability of onset by age 50 is 0.95. These probabilities are given with a 95% confidence interval for CAG lengths between 41 and 56, as these are the lengths of the samples used to build the model. Predictions for CAG beyond these limits are extrapolations. Figure 2 shows the cumulative probabilities for different CAG lengths of the model. Figure 2: Cumulative probability of onset of HD for various CAG lengths based on the Langbehn model. Figure by Langbehn et al. [30]. 6 Exploring ML approaches for phenotype prediction of HD There are other published statistical models that fit relationships between CAG length and clinical onset. There are two main approaches: using some sort of linear regression, or as the Langbehn model, using modeling techniques particular to survival analysis. Regressions tend to provide overly pessimistic estimates of AO specially for shorter CAG lengths, because they do not considerate cases which never reach HD diagnosis. Survival analysis accounts for samples that do not reach onset [31]. All models struggle predicting onset for the shortest expansions, due to either statistical biases or un- representative sample sets because subjects in this range are mostly non-symptomatic and are rare in clinical samples. Figure 3 overlaps the mean AO predicted for different CAG lengths by different models. Figure 3: Mean as estimated by various published formulae. Figure by Langbehn, Hayden, Paulsen et al. [31]. The PREDICT-HD study was a multi-site observational study ongoing for more than a decade at the start of the century aimed at identifying biological and refined clinical markers of early HD in humans, to validate them for use in preventive clinical tri- als. Outcome measures include basal ganglia volumes on magnetic resonance imaging (MRI), estimated probability of diagnosis based on CAG length, performances on 21 standardized cognitive tasks, total scores on 3 scales of psychiatric distress, and motor diagnosis based on the Unified Huntington’s Disease Rating Scale [32]. Studies done over the data generated by this study try to relate biological and clinical measures to the estimation of years to diagnosis, to better understand how the disease symptoms start manifesting and evolve. Estimated time to diagnosis is correlated to most clinical and neuroimaging markers [33]. Detectable changes were seen 10-20 years before the predicted age of clinical diagnosis. Structural MRI scans alone have been proven suf- ficient to separate presymptomatic disease gene mutation carriers and prodromal from controls [34], [35]. 7 Exploring ML approaches for phenotype prediction of HD 2.4 Machine Learning ML is a branch of artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit instruc- tions. It is considered a knowledge discovery technique, together with statistics, as it is an approach to characterize and predict complex phenomena described by a set of variables [36]. In the context of the study of biological systems, the two major goals pursued are in- ference and prediction. Inference refers to the capability of creating a mathematical model of the process creating the studied data in order to understand such process or test a hypothesis about its behaviour. Prediction is focused in forecasting unob- served outcomes, without necessarily understanding the underlying mechanisms of the involved process. Both statistics and ML are methods that can achieve both goals, but classical statistical methods are historically more focused on inference, and ML meth- ods concentrate on prediction [37]. Statistics requires previous knowledge about the data-generating system, as different methods take different assumptions [38]. This is not the case of ML, which makes minimal assumptions about the systems. This proves particularly helpful when exploring unknown systems with interactions that can be complicated and nonlinear [39]. Another advantage with respect to classical statistics is the size of the data ML can deal with, allowing a larger number of input variables (features) than the number of samples. ML can be broadly categorized into two types: supervised learning which uses pre- labeled samples to allow the algorithm to see how accurate its performance is; and unsupervised learning, where the algorithm can identify patterns amongst unlabeled data. 2.5 Market analysis HD is a disease present in all populations but occurs at much higher frequencies among individuals of European ancestry. Prevalence studies show that approximately 1 in 7,300 individuals are affected in Western populations [15]. The demand of HD testing and phenotype prediction is overall considered to be between 5 and 20% amongst at- risk patients. Data on subjects tested from different centres show a higher percentage of female subjects being tested than men [40]. Reasons to undergo genetic testing generally include future planning in career and family decisions. Because there does not exist an effective treatment of the disease, genetic testing is still low. The uncertainty of predictions for short CAG expansions also contributes in a low testing percentage. With a better model, genetic counselling would be more reliable. A better AO regression model could also be a potential research tool in HD research, to understand the mechanisms of disease onset and reveal possible targets for treatment [2]. 2.5.1 ML in disease phenotype prediction The most common type of studies to reveal genotype-phenotype associations are the already introduced GWAS, centered in identifying statistically significant associations. 8 Exploring ML approaches for phenotype prediction of HD Using ML algorithms to build models that correlate genotype with phenotype shifts the goal from deeply understanding the phenotype (main GWAS goal) towards accurately predicting it [41]. Still, in a disease diagnostics setting, the explainability of a prediction is crucial, and thus algorithms that produce interpretable models are preferred. There are two main challenges when applying ML onto the study of the effects of multi- ple genetic variables: the size of the datasets that are usually used (with more variables than samples due to the extension of the genome) and the effect of LD between SNPs, which is most likely to be present when a large number of SNPs are genotyped on a genome-wide scale, and which makes the dataset have correlated variables. This is why either penalized regression methods applied onto standard linear regressions or non-parametric approaches like the ensemble methods are needed to model the com- plex relationships contained in the data even though it probably contains correlated variables [42]. With this consideration, many researchers have shown the potential of SNP information for phenotype prediction [42], [43]. 2.5.2 Enroll-HD studies The study of rare diseases with ML has an extra challenge involved due to the ML requirement for large datasets. Enroll-HD is an outstanding observational study with almost 25,000 participants, being the world largest study for HD [44]. This dimensions make studying HD with ML possible. The dataset provides a very complete dataset with very diverse information about HD patients: general information such as sociode- mographic data; medical information like medication and nutrition; motor, functional, behavioural and cognitive assessments, ratings and tests; and genotyping information from blood samples. There are several published studies which use diverse ML techniques trained with Enroll-HD datasets. Ouwerkerk et al. (2023) [45] explores an Enroll-HD dataset of over 300 categorical and numerical features applied into improving AO prognosis, achieving an improvement of 9.2% with respect to the Langbehn formula estimation by using a light gradient boosting machine; and being able to predict future time points of disease progression. Ko et al. (2023) [46] recently applied ML into modeling disease progres- sion with a similar heterogeneous dataset, using an unsupervised method to cluster samples according to progression velocity, followed by a supervised method (XGBoost) to identify features which could predict disease trajectory. The feature corresponding to the product of age and CAG repeat length was the top predicting feature, followed by years since symptom onset and medical history of apathy. No published study has been found using ML with the GWAS dataset from Enroll-HD. The specific dataset used in this project was assembled by the GeM-HD Consortium [47] and later on expanded by Lee et al. [8], and used in GWAS analysis which revealed possible AO modifying SNPs. 9 3 Concept engineering The main objective that summarizes the driving purpose of this thesis is to predict the AO in HD from SNP information. There are several approaches regarding different stages of the project, mainly: i) how to treat the available data (approximately 40GB of compressed raw data), and ii) what algorithms should be tested and how to compare them. The project’s workflow (Figure 4) highlights the steps where we could uptake different methodologies, which will be discussed in this section. Figure 4: Project’s general workflow. 3.1 Data Preprocessing 3.1.1 Feature engineering The most usual approach to encode SNPs as numerical features consists in taking each SNP as an individual feature, with values 0, 1 or 2 depending on the allele of each sample (heterozygous or homozygous with the reference or alternative allele). This is a more suitable data format for further ML steps, as the raw data is encoded by 0/0, 0/1 and 1/1, which is not an appropriate input format. We could also define as feature the sum of the SNP encodings in each gene, resulting in one feature per gene and, consequently, reducing drastically the dimensionality. We could even add only the encoded value of SNPs that have an effect on the gene’s corresponding protein’s structure, those found in coding regions. Considering that taking all SNPs in our data pertaining to a single gene we can find up to 2000 SNPs/gene and in average being 204 SNPs/gene, reducing these numbers to a single dimension would extremely reduce the number of features given to the models, making them easier to train. Nevertheless, this places the focus of the conclusions on a gene level rather than the SNP level, which would reduce the specificity of the results. Moreover, masking of important events in large genes could make important polymorphisms invisible. Still, the main interest of the thesis is to work at SNP level, which will allow us to identify relevant variants and model epistatic interactions (interactions between genes to express a certain phenotype). Using all available SNPs directly would be a first exploratory analysis similar to a large Genome-Wide Association Study (GWAS), but it would require an unfeasible amount of computational power to train models with millions of features. Instead, we need to establish which is the best sub-setting criterion. 10 Exploring ML approaches for phenotype prediction of HD a. Literature-Based SNP Selection. One option could be taking the SNPs that have been previously described in literature to explain AO variance. This way we would be working with a low number of features, but the potential of revealing new AO modifiers would be zero. The next logical step would be to take the SNPs pertaining to genes described as either HD modifiers or related to AO explanation. Now we could find what SNPs inside each gene are the ones that could be responsible for the already observed gene effect. The discovery potential of this option is also limited. b. Ontology-Based Selection. A broader option is to take the genes of ontology terms related to HD processes, after a thorough literature review. This would include genes which have been previously studied in this context and genes related to them, which have not been studied, but yet they are not randomly selected. The set of genes of each term can be extracted as files from the AmiGO web application [48]. c. Protein Interaction Network-Based Selection. Another parallel approach to broaden the subset of selected SNPs is to include the genes related to a core set of genes on the protein level, taken from the STRING database [49]. In this case another problem arises: how many levels of interaction, which we named interaction order, should we include? Should we just take those genes directly related to the core set of genes (first order interactions), or should we expand it further to those related to this first order interactions, related to the core genes through a two link path (second order interaction), and so on? We would have to test how such a network would grow including interactions of increasing order. This interactions are extracted from databases where a confidence interval for each interaction is given, representing how confident is the community in that interaction’s existence. Different thresholds of this confidence should also be considered when exploring this option. d. Alternative Allele Prevalence Selection. An additional option for dimen- sionality reduction whilst treating single SNPs as features is to filter them by prevalence of the alternative alleles, by removing those SNPs that do not reach a minimum threshold of the alternative allele in our sample set. This would follow the reasoning behind treating a single nucleotide variation as a SNP. 3.1.2 Feature scaling Some ML models are more easily interpretable if the data given as input has a specific distribution, or they require a specific feature value range. As a general rule, models perform better if all the used features have values in comparable ranges. This is why feature scaling is an important step to consider. In out data we include the sex and CAG repeat length of each sample as two additional features apart from the selected SNPs, as we have to control for the features that most highly explain AO variance and this is most probably the CAG repeat lenght. Sex is encoded with 1 (male) or 2 (female), which are values similar to SNP encoding (0, 1 or 2). CAG length, on the other hand, is much larger, containing values between 40 and 11 Exploring ML approaches for phenotype prediction of HD 55. This feature must be scaled down. The most common scaling methods for ML are: a. Min Max Scaler: preserves the original distribution of data but scales it linearly to be in the range [0, 1]. y′ = y − ymin ymax − ymin (1) b. Standard Scaler: transforms data to have mean (µ) equal to 0 and unit standard deviation (σ). y′ = y − µ σ (2) 3.1.3 Outcome scaling AO is also spanning between 5 and 83. Depending on what algorithm we will be using, we might want to scale it too in order to make the residuals centered at 0. The scaling that make this transforms data to be Gaussian, so we will use the standard scaler on the outcome vector. 3.2 ML models Being a continuous random variable, AO prediction can be thought as a regression problem using a supervised learning process: we use the known outcome (AO) to build a regressor. As in any ML application, having more samples than features is an im- portant limitation, as any model build with such high-dimensional data are prone to overfitting, giving a model that could not predict new unseen observations successfully. Penalized regression models are a recorded necessity in settings similar to ours [42], [50] to constrain the model to represent the overall relations in the dataset rather than each sample’s particularities. 3.2.1 Regression models Regression models that are used for similar applications to ours (regression over SNPs data for phenotype) comprise linear least squares methods and tree-based methods [51]. 1. Linear regression methods These methods include all models that can ultimately be expressed as a fit of the response Y to the data X through a vector of coefficients β and a disturbance term ϵ: Y ≈ βX + ϵ (3) 12 Exploring ML approaches for phenotype prediction of HD A multiple linear regression model estimates as many coefficients as features used as predictors from the data. After optimizing the unknown parameters of the model (β vector), these can be used to predict the outcome of new observations, being the new observations x and their prediction yˆ. The most common way of estimating the coefficients is the least squares method: it looks for coefficient values that minimize the error between the true outcome and the predicted out- come, which is expressed by the residual sum of squares (RSS), being a residual the difference between the true outcome and the predicted outcome. In Eq. 4, n is the number of samples used for training the model, and p the number of predictors. RSS = n∑ i=1 e2i = n∑ i=1 (yi − yˆi)2 (4) This method is very straightforward, but fitting a linear regression onto any type of data is not guaranteed to yield good results. Problems as multicollinearity, overfitting and non-linearity of the response-predictor relationships pose the need for complementary methods to model a broader set of cases. Non-linearity can be addressed by using polynomial regression, interaction terms and transformations, amongst other options. They all involve fitting the data to functions which are no longer linear. Generalized Linear Models (GLMs) extend linear regression to handle non-normal response variables like counts of discrete while allowing for different types of response distributions and link functions. This would be an interesting approach to apply in our case, as AO can be thought as counts coming from a Poisson distribution, and we could use a Poisson regres- sion, a specific type of GLM [52]. But our AO vector contains decimal onset ages, and therefore we should round its values to integers, which would involve loosing information. The possible non- normality coming from the theoretic distribution of our response variable can be ignored considering the approximation given by the Central Limit Theorem (CLT) which implies that for a sufficiently large sample size, the distribution of the sample mean of random variables will be approximately normal [53]. This normality approximation should be tested empirically (with statistical tests such as Shapiro-Wilks or Kolmogorov–Smirnov) as part of the preprocessing steps, as SNP counts (considering the counts as the genotype) are not entirely independent in the case of LD blocks. Such tests have to be computed over subsets of the outcome vector, as any test will find enough evidence of a non-normal distribution with such large statistical power provided with our data (over 9k samples). We can also test for normality graphically, by fitting the histogram of AO to a normal distribution and assessing the fit, or through a Q-Q plot. If we can prove that the outcome variable can be approximated to a Gaussian distribution, we can use the least squares approach. We can build a simple model using Ordinary Least Squares (OLS) to work as the baseline, using as predictors only CAG repeat length and sex. Then we can build 13 Exploring ML approaches for phenotype prediction of HD more complex models including the SNP genotypes. Our main interest resides in shrinking methods which focus on regularizing the coefficient estimates, making them close or equal to 0. Such methods • Act as feature selectors, which is important to highlight feature impor- tance in a model towards finding AO modifiers. • Control overfitting by introducing a controlled amount of bias which re- duces the variance of the model, thereby encouraging the model to generalize well to unseen data. • Deal with multicollinearity, the situation where features are correlated with each other and thus either provide the same information or their effect can be canceled if the correlated features have opposite effects (a large posi- tive coefficient can be canceled by its related large negative coefficient) [54]. This could happen in our context when having SNPs of the same LD block. The three main shrinkage methods are the following: a. Ridge This technique uses the l2 regularization: it introduces a penalty term in the function to minimize (Eq. 5), which is the product of the tuning parameter l2 to the norm two of the coefficients’ vector. F (β) = RSS+α p∑ j=1 β2j (5) The effect of this added penalty is that it shrinks closer to 0 the largest coefficients, and acts less in those which are already close to 0. A higher α value decreases variance but increases bias. An optimal parameter value has to be found to reach a nice equilibrium between bias and variance and have a good associated model performance. b. Lasso This approach also includes a penalty term to the minimized function, but in this case it uses the norm 1 of the coefficients’ vector, F (β) = RSS+α p∑ j=1 |βj| (6) Such penalty acts on the model by performing variable selection when α is sufficiently large, shrinking to exactly 0 those features which contribute less in the model. It is indifferent to very correlated predictors, it will tend to pick one and ignore the rest. 14 Exploring ML approaches for phenotype prediction of HD c. Elastic Net Elastic Net can be considered a fusion of both Ridge and Lasso, as it uses l2 and l1 regularization at the same time. The minimized function (Eq. 7), when implemented, considers α2 = 1− α1. F (β) = RSS+α1 p∑ j=1 |βj|+ α21 2 p∑ j=1 β2j (7) This penalty is particularly useful in situations where p >> N or where there are many correlated predictor variables. For a penalty close to the Lasso, it can perform just like a Lasso while removing strange behaviours caused by extreme correlations [55]. As we will most probably be working with a very big dataset where the number of features is going to be very large, the most useful regularization to implement is an l1 regularization for feature selection, and thus Lasso is the principal model to try out. Elastic Net is also a good option. 2. Tree-based methods These methods involve segmenting the predictor space into a number of simple regions: the prediction of a new observation is set to the mean or mode response value for the training observations similar to the new input. The segmentation of the predictor space follows a set of splitting rules, which can be thought of as a tree, hence the method’s name. Each split, called node, generates two leafs, creating its own unique simple regression model [56]. The used approach to segment the predictor space (the space made up of all features used in the model) is known as recursive binary splitting, which is: top-down in the sense that it begins splitting by just one feature, at the top of the tree where all observations are still in the same region; and greedy because each split step is chosen considering the immediate goodness of the split, without thinking of possible better combinations with other splits further down. The goodness of a split is assessed by minimizing the RSS over all regions (Eq. 8) where Rj are the predictor regions (having a total of J regions), and yˆRj the mean response for the training observations within the jth region. RSS = J∑ j=1 ∑ i∈Rj ( yi − yˆRj )2 (8) The main benefit of tree-based methods is their interpretability. But a single decision tree may not always have a good prediction accuracy due to its simplicity. It has a very high variance, and it tends to overfit, as feature space splitting usually ends when the split space contains a small number of train observations, which can make the tree too case specific. To solve this, a threshold on RSS could be applied, but this could leave out low minimizations of RSS that later on 15 Exploring ML approaches for phenotype prediction of HD would reveal important splits. This overfit is solved with cost complexity pruning or weakest link pruning : a tuning parameter α controls a trade-off between the subtree’s complexity (amount of nodes) and the fit to the training data. In the minimized equation optimized in this approach (Eq. 9), |T | is the number of end nodes of the tree T, Rm is the predictor subset of each terminal or end node, and yˆRm is the predicted response associated with Rm. A more complex tree will have a higher number of terminal nodes, which will make the function’s independent term value, the penalization, larger. F (|T |) = |T |∑ m=1 ∑ i∈Rm (yi − yˆRm)2 + α|T | (9) Tree ensemble methods are more complex approaches which produce multi- ple weak trees or learners and later combine them to produce a single output prediction. These are: a. Bagging: trees are grown independently using a random set of samples and all features being available for every tree growth. b. Random Forests: trees are also grown using random sets of samples, but for each split on each tree only a random subset of features is available. A similar approach is followed by Extra Trees, with the difference that it chooses the split points at each node randomly, whereas Random Forests searches for the best splitting points for each random subset of features at each node. This makes Extra Trees faster to train, with the price of having higher bias. c. Boosting: trees are grown successively, starting from a single tree which uses all samples and all features, and continuing on fitting the following tree to the residuals of the previous one. It is a numerical optimization technique for minimizing the loss function by adding a new tree at each step that best reduces the loss function (stepping down its gradient) [57]. Random forests are preferable over bagging in our case as we restrain the used features as predictors to a random subset of features. This way if there is a very strong predictor as CAG is, we can create trees that will explore relations between other features apart from CAG. We can try Extra Trees if the implementation of Random Forest is computationally unfeasible. Both bagging and random forests make predictions taking the average of all their regression trees which are independent. This is why increasing the number of trees can prevent overfitting. Boosting grows trees by capturing signals which have not been accounted for by the previous set of trees, so a large number of trees can overfit data. The size of trees in boosting methods is also reduced compared to bagging and random forests [58]. Inside boosting methods, there are different tree-methods: the exact solution and 16 Exploring ML approaches for phenotype prediction of HD the approximated solutions. The exact solution involves the booster considering all candidates from data for tree splitting, which makes this method very slow computationally [59]. Approximated solutions build a gradient histogram for each node and iterate through the histogram instead of seeing all observations of the real dataset [60]. 3.2.2 ML packages The most commonly used resource for easy ML algorithm training is Scikit-learn for Python. Apart from containing functions for several algorithms, it also contains mod- ules on metrics to evaluate the fit of a model, and model selection to perform cross validation and grid search to find the best parameters. Similar packages are statsmod- els and skglm. For boosting algorithms, the most popular package is XGBoost. As a starting point, we will implement the chosen models with the two most simple sources: Scikit-learn [61] and XGBoost [59]. 3.2.3 Data sparsity Another important aspect to keep in mind when choosing algorithms is the large di- mension of our data. Training any model with a feature matrix taking several gigabytes of memory can turn out to be impossible with the available resources. Considering the matrix is mostly formed by zeros, as the reference homozygous is the average genotype we will find in most cases, we can use sparse matrices. This is a way of representing a matrix by listing the non-zero elements of the matrix and their position. This way we store less values than by recording all the positions that simply contain zeros. Most ML training packages in python, including Scikit-learn and XGBoost allow giving as input a sparse matrix. 3.2.4 Model metrics Because we are training models with high-dimensional data where the number of fea- tures is larger than the number of samples, reporting measures of model fit using sta- tistical metrics on the training data will not be very informative, as the models will most probably be overfitted. Instead, these metrics should be computed over a test set to compare the true predictive ability of the models. Calculating these metrics over the train set is also done, but to check for overfitting rather than for model comparison. There are different statistics to evaluate the goodness of fit of a model [62]. In the following expressions, y represents the outcome vector true values, and yˆ their corre- sponding predictions. a. Max error: single worst error amongst all predictions. b. Mean absolute error (MAE): average of the absolute difference between the true and predicted values of y. 17 Exploring ML approaches for phenotype prediction of HD MAE(y, yˆ) = 1 nsamples nsamples −1∑ i=0 |yi − yˆi| (10) c. Mean squared error (MSE): expected value of the squared error. MSE(y, yˆ) = 1 nsamples nsamples −1∑ i=0 (yi − yˆi)2 (11) d. Coefficient of determination R2: proportion of variance of y explained by the independent variables in the model. R2(y, yˆ) = 1− ∑n i=1 (yi − yˆi)2∑n i=1 (yi − y¯)2 (12) y¯ is the true average 1 n ∑n i=1 yi and ∑n i=1 (yi − yˆi)2, which is the RSS. The closer the score is to 1, the better the fit to the data. A 0 value of R2 means the model is constant, always predicting the expected value of y disregarding the input features (yˆi = y¯ for i = 1, ..., n). It can also be negative, if the model is arbitrarily worse than the constant model. e. Explained variance score: fraction of deviance explained. A perfect fit explains the totality of the data variance, hence the score would be 1. Lower values correspond to worse fits. explained variance(y, yˆ) = 1− Var{y − yˆ} Var{y} (13) It is similar to R2, but the explained variance score does not account for systematic offset in the prediction. There are many other metrics we could use, but these are the simpler ones which should be enough to compare between models. These methods tend to be very dataset dependent, thus the test set to evaluate all models will be the same. The preferred metric in many different settings is the coefficient of determination. The model’s performance can also be assessed visually by plotting the predicted values over the actual values of all samples in the data set. Another useful plot is the repre- sentation of the residuals (predictions subtracted from the true values) over the actual values, to see whether a range of outcomes is better or worse predicted than the rest. 18 4 Detailed engineering All generated code can be consulted in the ML-HD GitHub repository (URL: https: //github.com/cfuses/ML-HD). The repository organization follows the same direc- tory structure as the code directory of the computer where the scripts were executed. Throughout this section some code snippets are included between the text to show the parameters used in relevant functions or to see in a more direct way how certain non-standard processes were executed. Scripts were build and executed in a remote machine provided by Creatio. To manage the required packages, a Conda environment was created with the necessary packages: Plink2, Numpy, Pandas, Scipy, Sklearn, Matplotlib, os, XGBoost, and StatsModels. The “GWAS12345” Enroll-HD data was acquired from CHDI in .tar format. The un- compressed folder contained .bed, .bim, and .fam files for each autosomal chromosome plus a metadata file containing sex, CAG repeat length and different residual calcula- tions of each subject, as well as a readme file with information about the compression format. The total number of subjects is 9064, with 4,647 females and 4,417 males. CAG values range from 44 to 55, with a mean of 44 and a standard deviation of 3.06, which means we will only be working with full penetrance HD alleles. The mean AO is 45.54 years, with a standard deviation of 11.58 years. The outline of the preprocessing steps taken is represented in Figure 5. Each process is explained in detail in the following subsections. Figure 5: Data preprocessing flow chart. 4.1 Core gene selection We considered that for an application with simple ML models, the ontology-based selection of genes would be enough. After a thorough literature review on genes and 19 Exploring ML approaches for phenotype prediction of HD processes related to HD mechanisms [2], [5], [7], [8], [15], [19], [63]–[66], a list of GO terms to be included in the analysis was assembled, presented in Table 2. Process GO term Related Genes Mismatch repair GO:0006298 FAN1, MLH1, MSH3, MLH3 Synaptic transmission, Glutamatergic GO:0035249 GRIK2, GRIN2A, GRIN2B Omega peptidase activity GO:0008242 Cysteine-type endopeptidase activity GO:0004197 Proteasome-mediated ubiquitin-dependent protein catabolic process GO:0043161 Ubiquitin binding GO:0043130 Ubiquitin protein ligase binding GO:0031625 Protein deubiquitination GO:0016579 UCHL1 Transcription regulator activity GO:0140110 TCERG1, TP53 Neuron apoptotic process GO:0051402 DFFB, MAP3K5, MAP2K6 Lipoprotein metabolic process GO:0042157 APOE Axonal transport GO:0098930 HAP1 Folic acid metabolic process GO:0046655 MTHFR Energy reserve metabolic process GO:0006112 PPARGC1A Table 2: Included GO terms as core genes. These set of genes associated to each GO term was downloaded from the AmiGO web application [48], selecting three fields to download: the Gene/product, which contains the UniProt identifier; the Gene/product name which is the name of the gene (a string with words separated by spaces); and the direct GO class, as GO terms are hierarchi- cal and a single gene can be found under different terms (which makes a single gene appear in several rows if it has more than one direct GO class). The downloaded files are tab-separated text files. In a local Jupyter Notebook, all these files were concate- nated, and an additional column containing the GO class from which each set was taken (broad GOclass was included. Using ExPASy and SwissProt from Biomanager ’s Python package the UniProt ID of each gene was translated into the gene symbol, saving it into a new column called Gene. Later on the process of this thesis, further literature review revealed potentially important genes in HD processes which were not included in the core genes set. These are HTT, MAP2K6, PRMT6, CCDC82, SOSTDC1, and FAN1. They were added manually to the list. We could have expanded the list further on by including the entire set of genes of the GO terms of these new genes, but at that moment we were facing important memory availability issues and decided not to. Once this table was assembled, we uploaded it onto the remote machine used for the rest of the analysis. The list of unique Genes are from now on referred to as the core gene set of our study. 20 Exploring ML approaches for phenotype prediction of HD 4.1.1 Core genes SNPs The Enroll dataset identifies SNPs through their chromosome position, but we need to relate them to their corresponding gene, which is easier when they are identified with their Reference SNP cluster ID (Rsid). This was achieved by creating a look- up table containing all SNPs from each core gene. The script performing this task is biomart retrieval/biomart snpid gene retrieval v2.R. For such task, the Ensembl Biomart database was used, and to easily access it we used its R API. We first extract the starting and ending positions of each gene, giving as input a vector containing the list of core genes (gene.list), and then we save the output in a data frame gene.symb. 1 # Connect to the Ensembl BioMart database 2 ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl", 3 "https://feb2014.archive.ensembl.org") 4 5 # Retrieve gene coordinates 6 gene.symb <- getBM( 7 attributes = c("ensembl_gene_id", "external_gene_id", 8 "chromosome_name", "start_position", "end_position", "strand"), 9 filters = "hgnc_symbol", values = gene.list , 10 mart = ensembl) It is important to access a dataset using the same genome assembly reference as our data, which is GRCh37/hg19, so the positions of our SNPs are comparable to the chromosome coordinates obtained with this script. Next, we access another Ensembl dataset, now containing information on SNPs. We read the gene.symb row by row, which makes the iteration consider one chromosome at a time, and ask for the position, reference identification and reference variant of all SNPs inside each chromosome by passing a query with the format ::. This petition is tried at least 5 times per chromosome, as the connection with the dataset is often unstable. When the output data frame is obtained, we bind it to a general data frame where we concatenate the outputs of all chromosomes, finally obtaining the look-up table. The following code snip is a reduced version of the used code, leaving out the error handling: 1 # Empty data frame to store ensembl outputs 2 snp.tab <- data.frame() 3 4 # Iterate over chromosomes 5 for (i in 1:nrow(gene.symb)) { 6 7 # Create coords vector as required for getBM 8 query <- paste(gene.symb$chr[i], gene.symb$st[i], 9 gene.symb$end[i], sep = ":") 10 11 ... (while loop to handle connection errors) 12 # Get table from ensembl 13 sub.snp.tab <- getBM( 14 attributes = c("chr_name", "chrom_start", 15 "refsnp_id", "allele"), 16 filters = c("chromosomal_region"), 21 Exploring ML approaches for phenotype prediction of HD 17 values = query , mart = mart 18 ) 19 } 20 21 # assign gene name 22 sub.snp.tab$gene <- gene.symb$gene_name[i] 23 24 # Append to general snp tab 25 snp.tab <- rbind(snp.tab , sub.snp.tab) 26 } The table generated through this script was then revised manually using bash com- mands. Initially we had a table with 4,153,337 rows. We then applied two filters: • Only those SNPs which had Rsid (reference identifiers starting with rs) were kept, reducing the rows to 4,106,699. • We just want to include SNPs, which are single nucleotide variations. We leave out all those variations that are insertions, deletions, and other types of mutations which do not follow the format Nr/Na (reference and alternative bases). The final revised and filtered look-up SNP table has 3,590,278 SNPs. For easier result visualization, a final table relating SNPs to the GO terms from which we extracted their corresponding genes was generated with bash in the script data preprocessing/mapping snps to GO v2.sh. It iterates over each SNP in the look-up table and searches their gene in the table of core genes which relates genes to their GO terms. 4.2 Data preprocessing 4.2.1 Feature matrix assembly The Enroll-HD dataset contains three files per chromosome which encode each individ- ual’s alleles for all the sequenced SNPs: .bed, .bim, .fam. To work with a simpler format, Plink2 software combines this kind of data into a single vcf file (Variant Coding File). Having all three files for each chromosome in the same directory (each group of three is named equally, only changing their suffix so we can point to the files through a single file name), we generate the vcf files and build the final SNP matrix with the bash script vcf generation/run plink.sh. The first step is to iterate over each chromosome with a for loop and generate the vcf files with the following command: 1 plink2 --bfile "$filename" --recode vcf --out "$out" This command generates two files: a .log file which will not be used and hence it is deleted, and the .vcf, which is a text file where each row is a SNP position, the first 10 columns describe the SNP and the following ones are the samples’ genotype in that chromosome position. The genotype is encoded with 0/0 (homozygous for the reference variant), 0/1 (heterozygous) and 1/1 (homozygous for the alternative variant). To avoid saving the totality of the encoded SNPs contained in the original dataset, just after creating each vcf and still inside the for loop we filter the recently generated vcf 22 Exploring ML approaches for phenotype prediction of HD file using the SNP look-up table: we take the positions of all those SNPs in the table whose chromosome matches with the for loop iteration, and we keep the rows in the vcf which match these positions using the command grep. With this filtering we are not including the file’s headers. The row containing the samples’ ID is saved in a separate file before filtering, and we externally checked from the command line that the column order is the same in all generated files for each chromosome. Right after this filtering, we also discard the columns that give additional information about the SNP, such as what are the reference and alternative bases on that position, the quality score that indicates the confidence that the variant is correctly called, filter status, the format of the genotype fields. By doing so, we delete the information that could differentiate between multiallelic SNPs, but they are still being included in the model. This is a common simplification done in the genomics field, as we are mainly interested in representing if a sample has an alternative variant or not, rather than what precise nucleotide is in that alternative variant, disregarding the possible different effects these variants could have down the transcription processes. This is why Rsid nomenclature is unique for the SNP position, and not for the alternative alleles. Once the vcf is filtered, the genotype is re-encoded with bash command sed to be easier to work as ML features, changing 0/0, 0/1 and 1/1 for 0, 1 and 2, respectively. Once this has been done for all 22 chromosomes, all filtered files are concatenated, generating a single table with all the SNPs of the core genes and their genotype in each sample. Since we are treating SNPs as features, we need them as columns, not rows. Due to the large size of the table, we transpose the numeric part of the table (transposing the chromosome and position columns separately afterwards) with the C++ script vcf generation /transpose matrix.cpp. Regarding the chromosome and position columns, we need to merge them into a single identifier. We do so by using their Rsid, once again using the look-up SNP table extracted with BioMart. Now we can concatenate back together SNP identification and their genotypes, now having each SNP as a column. Finally, we add three columns containing the sample identification, their sex and the number of CAG repetitions (extracted from the sample information file) as the first three columns, and we have a first version of the feature matrix, occupying approximately 11GB. 4.2.2 Alternative variant prevalence filtering To reduce the dimensionality of our data in the most unbiased fashion possible, in the concept engineering we presented the option at hand: filtering out those SNPs which are homozygous for the reference variant in more than 99% of the subjects, or equivalently only taking the SNPs for which at least 1% of the samples have an alternative variant. This is done with Python in the script data preprocessing/alternative prevalence colsuming.py. The filtering is done by applying a boolean mask to the feature matrix. This mask is created from an initialized vector containing as many indices as SNPs, and each value, which starts as zero, will ultimately be the number of subjects which have at least one 23 Exploring ML approaches for phenotype prediction of HD copy of the alternative variant. This way if all subjects had at least one copy of the alternative variant in a certain SNP, the corresponding index of that SNP in the vector would contain the total number of samples of the dataset. As the table to filter is so large and loading it entirely onto processing memory can make everything slower and more expensive in terms of computation resources, the script reads it line by line (one subject at a time). It then looks at all SNPs genotypes one by one, adding 1 to the SNP index of the already mentioned vector if the genotype is different from 0 (either 1 or 2). Once the iterator has gone through the entire matrix, the boolean mask is created by comparing each vector value with the minimum value chosen as threshold, which is the 1% of the total number of samples. This mask is then applied to the feature matrix by opening an output file in write mode and opening again our feature matrix in read mode, and for each line read, it applies the mask and writes it onto the new output matrix. The final matrix contains 339886 features, which means we are including 339884 SNPs in the models. 4.2.3 Toy example matrix The filtered version of the table achieved with the previous step takes 5GB of storage, having reduced 6GB of memory. Still, to make further developments faster to build and test, a toy example of 50MB is generated with 900 samples and 70k SNPs, forcing to include the SNPs of the genes which are some of the known HD modifiers candi- dates (MLH1, MLH3, GRIK2, GRIN2A, GRIN2B, UCHL1, APOE, ASK1, MAP3K5, PPARGC1A), and the SNPs of HTT, and picking randomly the rest to sum up to 70k. This is done in subsampling/random sample snps pick.ipynb. 4.2.4 AO normality check In the concept engineering section on ML models we have stated that AO can be considered a normally distributed variable through the approximation of the Central Limit Theorem. This approximation can be tested both graphically and statistically. We can do two different graphical tests: fitting the AO vector to a normal distribution and plotting the fitted curve on the histogram of the experimental data to see how accurate is the fit (Figure 6a); and a Q-Q plot (Figure 6b). Both tests show a normal distribution. Statistically, several tests were run: Shapiro-Wilk, Kolmogorov-Smirnov and normal- test, all from Scipy ’s module stats. These tests take as the null hypothesis that the data comes from a normal distribution, so a p-value lower than 0.05 would indicate that there is enough evidence to reject the null hypothesis, and thus would be con- trary to the normal approximation. When run over the raw AO all three tests reject the null hypothesis, and again when transforming the entire raw AO vector to try to make them more normal (logarithmic, reciprocal, exponential, Box-Cox, Yeo-Johnson, standardization). 24 Exploring ML approaches for phenotype prediction of HD (a) Normal fit over actual AO histogram. (b) Q-Q plot of AO. Figure 6: Graphical normality tests. The problem lies in overpowering the tests with a too large number of samples. When the vector was split into 30 subsets and each subset was tested individually, more than 80% of the subsets did not reject the null hypothesis. We can consequently treat AO as a normally distributed variable and continue implementing a wider set of model options. 4.3 Regression models All Python code to train and test the ML models was first tested using the toy example in the same Jupyter Notebook, and once the code was optimized, a separate Python script for each model was created to execute them in parallel. All these scripts can be found in the directory code/regression models/individual models. Each individual script follows the same structure: 1. Import dependencies, including the custom functions to load data and evaluate the model performance contained in separate scripts inside the same directory. 2. Declare the working directory and the different directories which will be accessed throughout the script, as well as the paths to the data files. 3. Load the feature matrix as a sparse array with the custom function read sparse and the outcome vector as a Numpy array. 4. Scale the CAG column of the feature matrix with the custom function scale CAG, and the outcome vector with the pickled scaler aooStandardScaler generated in the script regression models/aoo scaler.py. 5. Split the data in training and testing sets with the function train test split of the Sklearn’s module model selection. We ensure that the samples set aside for testing across different scripts are the same samples by setting the same ran- dom state number. The testing set size is set to the 30% of the total number of samples. 25 Exploring ML approaches for phenotype prediction of HD Model Parameter Parameter ranges Lasso alpha [0.005, 0.01, 0.05] max iter [1000, 2000] Elastic Net alpha [0.005, 0.01, 0.05] l1 ratio [0.2, 0.5, 0.9] max iter [5000, 10000] Random Forest ccp alpha [0.001, 0.01, 0.04, 0.1] max depth [2, 4, 6, 7] n estimators [10,20,30,40] XGBoost (approx) reg alpha [0, 0.1, 0.5] max depth [3, 5] n estimators [10, 20, 40, 80] XGBoost (hist) reg alpha [0, 0.5, 0.8] max depth [2, 3, 5] n estimators [5, 10, 15, 20] Table 3: Grid search parameter values for trained models. All parameters not stated here were left with the default value. 6. Model definition, different in each script, followed by a dictionary of hyperparam- eters to test in a grid search. 7. Grid search with 5 fold cross validation using the Sklearn’s function GridSearchCV, from the model selection module, over the training sets, using the default R2 scor- ing as the metric to evaluate the trained model on the validating set in each fold. 8. Extract the best estimator fitted with the previous grid search, and pickle it in a sub-directory inside the results folder to examine it further on a separate script. 9. Evaluate the training of the model using the custom function model description, saving the generated plot as a png image in the results directory. 10. Evaluate the model performance with the custom function model metrics, saving the generated outputs in the results directory. The custom functions are explained in greater detail in the following subsections. The parameter values included in the grid search were chosen on the preliminary testing done with the toy example in the notebook, starting with standard values and tuning them to ranges closer to the best parameters chosen in each execution. When running the scripts which use the entire dataset, we also executed the scripts again (when possible) changing the parameter ranges if the chosen hyper parameters in the first execution were one of the limit values to ensure we were not limiting the best possible model performance in this aspect. The models that were chosen as most suitable for our objective and the parameters tested in their grid search are presented in Table 3. The OLS script differs slightly from the models above, as it is a simple linear regression 26 Exploring ML approaches for phenotype prediction of HD with no hyperparameters to tune, and the feature matrix given to the model only contains the variables sex and CAG repeat length. The cross validation is performed with the function cross validate of the samemodel selection module as the grid search function, with the most similar configuration as the rest of cross validations performed (returning the best trained estimator, doing 5 folds and scoring with R2). With such a large dataset it is very difficult to obtain an unbalanced split if done randomly. Nevertheless, we also checked in the testing notebook that the split does not affect the balance of the dataset with regards to sex, CAG and AO distributions. This was done by plotting the amount of samples from each sex at each set in a bar plot, and the distribution of CAG lengths in each set with boxplots, separating also by sex (see the results in appendix A). Each Python script was executed through a Slurm job, creating batch scripts for each model. The maximum resources given are 200GB for execution, with a time limit of 24 hours. Any model configuration which took more than a day to execute was discarded. 4.3.1 Feature matrix loading The feature matrix is loaded in the model training scripts using read sparse X, a func- tion created in the script regression models/individual models/data loading.py. Trying to load the matrix directly using Pandas or Numpy was too demanding memory- wise. Our dataset is made of mostly zeros, as the reference genotype is the predominant one. Hence the usage of a sparse matrix, which only saves the non-zero elements and its indices, reducing drastically the memory needed for script execution. As the matrix is not saved as a sparse matrix in its directory, we need to convert it. We could load once the matrix with a computationally demanding tool such as the ones provided by Pandas or Numpy, convert it to a sparse matrix and save it as such to directly access the sparse version from the model scripts, or create a function which allows reading the original matrix directly. Using this last option we avoid having multiple copies of the same information in our memory. This is done by reading the dense matrix by chunks, converting these chunks into csr (Compressed Sparse Row) matrices, and concatenating the sparse chunks: 1 # Initialize an empty list to store the chunks 2 chunks = [] 3 4 # Open the file 5 with open(X_path , ’r’) as file: 6 # Read header line 7 header = file.readline ().strip().split("\t") 8 9 # Initialize a list to store chunk data 10 chunk_data = [] 11 12 # Read the file in chunks 13 while True: 14 # Read chunk_size lines 15 for _ in range(chunk_size): 16 17 line = file.readline () 27 Exploring ML approaches for phenotype prediction of HD 18 19 if not line: 20 break # Reached end of line 21 22 data = line.strip ().split("\t") 23 24 # Add feature data to row vector as float32 25 rowdata = [np.float32(val) for val in data [1:]] 26 27 # Add row vector to chunk_data list 28 chunk_data.append(rowdata) 29 30 if not chunk_data: 31 break # No more data to read 32 33 # Convert the chunk data to a CSR matrix 34 chunk_sparse = csr_matrix(chunk_data) 35 36 # Append the chunk to the list 37 chunks.append(chunk_sparse) 38 39 # Clear chunk data for the next iteration 40 chunk_data = [] 41 42 # Concatenate the list of CSR matrices into a single CSR matrix 43 X = vstack(chunks) It is important to note that the sparse matrix does not include the sample names to be able to fix the data type of the matrix to 32 bit floats, and thus reduce even further the memory usage. Using 16 bit integers would be even better, but as we scale CAG between 0 and 1, we need a float data type. The final X is the sparse matrix which is returned when calling this function. The function itself takes two parameters: X path where the matrix to read is saved, and chunk size, with default 100, which is the amount of rows read once at a time from the dense matrix. The bigger the chunk size, the larger the working memory requirement. 4.3.2 Scaling functions After loading the data we scale the CAG feature and the outcome vector AO. CAG scaling is performed by giving the sparse feature matrix to the function scale CAG, created in the script regression models/individual models/data loading.py. This function takes the second column of the sparse matrix, corresponding to CAG values, and performs a Min Max scaling over it. To ease computations, we extract the column as a Numpy array, scale it, and reallocate the array as a sparse matrix (using the function csr matrix) from the sparse module of the Scipy package. The function returns the entire sparse matrix, now with CAG scaled between 0 and 1. The AO scaling is done through a standard scaler, generated once executing the script regression models/aoo scaler.py. The script takes the file which contains the AO vector, loads it as a Numpy array, converts it to a 1 dimension array and uses it to fit the StandardScaler from the preprocessing module of Sklearn with the method fit. 28 Exploring ML approaches for phenotype prediction of HD The generated scaler is then saved with a pickle format, as later on we need to convert the scaled values of AO back to their original range in a controlled manner. When the saved scaler object is loaded in the model training scripts, it transforms AO values with the StandardScaler method transform, without needing to fit it again. 4.3.3 Evaluating functions Two custom functions were created to evaluate the model after training: a. model description, intended for seeing whether the model is overfitted by plot- ting the predictions of the training samples versus their actual values and by computing the percentage of deviance explained by the model. The in- put parameters are the regressor object, the feature matrix, the outcome vector, and optionally the outcome scaler if we want the plot to represent the values scaled back to the original AO values. The predictions are computed inside the function with the regressor object’s method predict. Then the percentage of deviance explained by the model is calculated with Sklearn metrics ’ function explained variance score. If the scaler is passed as a parameter, both pre- dictions and actual values are transformed to the original range with the scaler method inverse transform. Finally the plot is done with matplotlib, using the function hexbin to plot a heatmap containing the density of points inside each hexagon that splits the plotting area. We use a heatmap approach rather than plotting each sample as a point in a scatter plot because a scatter is not as infor- mative when we have thousands of samples overlapping. The subtitle of the plot contains the value of the explained variance. b. model metrics, which returns the metrics of the model (calculated using the pre- dictions of the test set), the model parameters and a plot to visually see how close are the predictions to their real values. The main structure is similar to the pre- vious function described, but this time around the input data must be the feature matrix and outcome vector of the testing samples. The metrics computed here are the coefficient of determination R2, mean squared error MSE and mean absolute error MAE. Their values are written in a text file, alongside the pa- rameters of the model, obtained with the regressor object method get params. The plot generated with this function is made up of two subplots, one plotting the predictions versus their actual values as in the model description plot, and another representing the residuals of the predictions versus the predictions, to see whether the error in predictions is bigger along a specific range or it is homo- geneous over all predictions. Both plots have a dashed line which shows where should all points fall in a perfectly fitted model (the function x = y in the case of predictions over actual values, and the line at y = 0 in the residuals plot). In this case we also use hexbin plots. 4.4 Results presentation Apart from seeing how well each model explains the AO effect with the given feature matrix through the outputs of the evaluating functions, we also want to see which are the features that contribute the most to each trained model. In the least squares 29 Exploring ML approaches for phenotype prediction of HD methods, this can be done by looking at the values of the optimized coefficients. In the tree based methods, feature importance changes across packages. In XGBoost, we can see feature importance through two amounts: weight and gain. Weight is the number of times a feature appears in a tree across all trees in the model, so a higher weight means that many trees have considered that feature important in their tree building process. Gain represents the average improvement in accuracy of the decision tree when used across all trees in the model. This is a more direct way of calculating the contribution of a feature to the model’s performance. Sklearn’s Random Forest already has a feature importance property, which is the Gini importance. These feature importance amounts are extracted from the regressor objects pickled just after training them, and loaded into regression models/model results.ipynb. Then the most important features, selected using the feature importance measurement suited for each model type, are related to the SNP they represent, and a sorted table presenting the results is assembled, containing the columns SNP, importance amount (coefficient, gain or Gini importance), gene and GO term. These tables are saved as text files. To further present our findings in a simple way, we can do plots following the Manhattan plot format used in GWAS: the horizontal axis represents the genome positions, starting from chromosome 1 up, and on their corresponding relative positions over a chromo- some, each SNP is represented as a bar, with its height being the feature importance measurement of the represented model. In least square methods, the vertical axis will be the coefficient value of a SNP, which means that it can take both positive and nega- tive values (whether a SNP coefficient is positive or negative is valuable information and we must make it visible). With tree-based methods, the vertical axis will be the gain or Gini importance of each SNP, and thus will only have positive values. These plots are done in regression models/results visualization/model results plots.ipynb. From the results tables we can also see from which GO term are the genes of the SNPs contributing to each model. This information is best seen using a segmented bar chart that visualizes the distribution of GO categories across the different mod- els. Each bar represents a different model, and the segments within each bar rep- resent the proportion of various GO categories present in that model. The height of each segment corresponds to the percentage of that particular GO category out of the total GO categories for that model. We also include the background propor- tions, the GO proportions in the feature matrix. This plot is generated in the notebook regression models/results visualization/go contributions.ipynb. In the same notebook we also compute a two-sided Fisher test to see which GO terms are signifi- cantly enriched in the model’s representation compared to the background. 30 5 Results 5.1 Baseline model The current models of HD AO estimation predominantly rely on CAG repeat length, with the explained variability ranging from 40 to 70%, depending on the source. With the Enroll-HD dataset, a simple OLS model using as predictors sex and CAG achieves an R2 of 0.5572. This indicates that the model can explain approximately 56% of the variance in AO, based on the performance on a test set (observations not used during training). 5.2 SNP models The performance of the models that use genotype information will be compared to this baseline to assess any improvement in prediction accuracy and to identify GeMs. 5.2.1 Metrics comparison All models were subjected to extensive regularization, with hyperparameters tuned through a grid search (see Table 3 in section 4). The selected hyperparameters for each technique are detailed in Table 4. The final models were trained with these optimized hyperparameters. The predictions computed on the training set predictions to check for overfitting demonstrate satisfactory generalization capability (see appendix B). Model Parameter Value Lasso alpha 0.01 max iter 1000 Elastic Net alpha 0.01 l1 ratio 0.9 max iter 10000 Random Forest ccp alpha 0.001 max depth 6 n estimators 40 XGBoost (approx) reg alpha 0.5 max depth 3 n estimators 10 XGBoost (hist) reg alpha 0.1 max depth 2 n estimators 20 Table 4: Hyperparameter values of the final models found using cross-validation. The metrics from each resulting model are shown in Table 5, which includes the baseline for easier comparison. The best performance, in terms of metric values, was achieved by the XGBoost model using the histogram tree method, which improved the baseline R2 by 0.03. The approximate tree method performed almost as well while requiring less training time. Random Forest also achieved a similar R2, but it took more than 31 Exploring ML approaches for phenotype prediction of HD Model R2 MAE MSE Training time Nº features OLS 0.5572 0.5124 0.4462 11.3 ms 2 Lasso 0.5478 0.5188 0.4556 10.2 min 867 Elastic Net 0.5395 0.5247 0.464 13.4 min 1164 Random Forest 0.5866 0.4883 0.4165 180.5 min 1159 XGBoost (hist) 0.5908 0.4844 0.4123 13.7 min 33 XGBoost (approx) 0.5847 0.4904 0.4184 8.3 min 34 Table 5: Model metrics, execution time and number of features used, including the baseline. 10 times the time required to train the XGBoost models. In contrast, the least squares methods did not enhance the prediction compared to the baseline. Thus, adding more features does not necessarily lead to better results. In addition, the relationship between genetic information and AO seems to be non-linear since models capable of capturing non-linear relationships, such as the XGBoost variants, outperform regularized linear models, which did not uncover additional predictive information from the provided features. These models can take interactions between SNPs into account implicitly. This is in contrast with linear models, which need to explicitly incorporate interaction terms in turn increasing the size of the models. Comparing the plots produced to graphically study the model’s predictions reveals interesting differences between the two methods. In Figure 7 we compare the best- performing models of each type (similar plots for the other models can be found in appendix B). Predictions of the least squares methods exhibit non-symmetric residuals, underestimating AO at older ages and overestimating AO at younger ages. Conversely, tree based methods show a more homogeneous distribution of residuals, although they fail to predict the youngest AOs. 5.2.2 Feature importance To further inspect the trained models, we examined the features they use. All models identified CAG as the most important predictor. In tree-based methods, CAG is almost always the splitting feature at the root node. However, the subsequent predictors vary between models. Using Manhattan-like plots, we can easily visualize the SNPs that contribute the most to each model. Figure 8 shows two types of Manhattan plots, again showing just one model of each method. Plots for the remaining models can be found in appendix C, followed by the tables with the top SNPs of each model. The labeling of SNPs in these plots does not indicate significance, it simply highlights some of the most important genes in each model. Notably, certain genes are used across multiple models. There is no single SNP which is used by all models, though there are intersections within model groups. Between the two linear models, there is an overlap of 839 features. Given that the Lasso model, which has the fewest features among the two, includes 866 SNPs, 32 Exploring ML approaches for phenotype prediction of HD Figure 7: Comparison of least squares method prediction (Lasso), top image, and tree-based method prediction (Histogram XGBoost), bottom image. we can consider that both models are practically the same. This is further justified upon realizing that the chosen l1 ratio for Elastic Net is 0.9, shifting its regularization more towards a Lass-like regularization rather than an l2 regularization. The five most important SNPs in these models come from the genes SMARCD3, FAN1, MLH1, TCF7L2 and ESRRB. Interestingly, ESRRB has two SNPs with high coefficients but opposite effects, rs8017707 hastens onset while rs2278687 delays it. FAN1 and MLH1 are commonly reported AO GeMs [16]. TCF7L2 is not considered to be an AO GeM, but a repressed expression of TCF7L2 in HD mouse models has been proven to affect the myelin maintenance [67]. ESRRB is not a known AO GeM either, although it is used as a pluripotency marker in HD induced pluripotent stem cells from mice models in [68]. The effect of such coefficients, however, cannot be considered to be the cause of the large variability of AO introduced at the start of this project. Since the coefficients were estimated with standardized AO values, we can calculate the shift from the mean that a unit change in the feature’s value causes by multiplying the coefficient’s value by the AO’s original standard deviation, 11.58 years. A SNP with a related coefficient of 0.03 would shift the AO prediction from the mean by 0.35 years if it only had one 33 Exploring ML approaches for phenotype prediction of HD Figure 8: Manhattan plots of a least squares method (Lasso), top image, and a tree-based method (Histogram XGBoost), bottom image. 34 Exploring ML approaches for phenotype prediction of HD alternative variant, and by 0.69 years if both chromosomes had the alternative variant. This demonstrates that AO variability cannot be explained by a single SNP, as AO can shift up to 30 years from the mean AO in the most extreme cases. This is also graphically illustrated in Figure 9, which shows the AO variability over CAG values for a random subset of 500 samples. Cases with an alternative allele for a specific SNP are highlighted in black. If the effect of the alternative variant were significant, we would see these samples shifted consistently to one side or another of the line representing the mean AO for each CAG repeat length. However, this is not the case. The mean curve is calculated using the entire dataset to make it constant between plots. Figure 9: AO over CAG repeat length of 500 random samples, colored by the presence of an alternative allele in SNPs rs61997076 (left) and rs144287831 (right). Between the tree-based methods we only have two intersecting SNPs: rs10169129 (MYT1L) and rs118089305 (FAN1 ). The plots of samples colored by their genotype over the CAG-AO plane in this two cases are different from the linear examples in the density of samples with an alternative allele, being much smaller than in the cases of the SNPs chosen by the linear models. This characteristic is seen in many more SNPs chosen by the two XGBoost methods. Nevertheless, their effect on AO is not clearer than the last examples. The two XGBoost tree methods have 7 SNPs in common. The first two are the already seen rs10169129 (MYT1L) and rs61997076 (FAN1 ), followed by rs141338757, an HTT SNP whose alternative variant is very scarcely found but seems to delay AO in the few samples which have it (Figure 11). The rest of SNPs are rs57013064 (CDYL2 ), rs72841819 (BTRC ), rs118089305 (FAN1 ), and rs5743061 (PMS1 ). Amongst the reviewed literature, we found SNPs which coincide with some of the top SNPs amongst the models: • rs144287831 from the mismatch repair gene MLH1, which is suspected to act on CAG repeat instability [5]. This SNP is used by the linear models and XGBoost (hist). 35 Exploring ML approaches for phenotype prediction of HD Figure 10: AO over CAG repeat length of 500 random samples, colored by the presence of an alternative allele in SNPs rs118089305 (left) and rs10169129 (right). • rs79727797 from the transcription regulator geneTCERG1, which is associated with a quasi-tandem repeat in the gene that hastens HD onset [69]. This SNP is only used by XGBoost (hist). These findings should not be interpreted as being genomic positions directly related to the phenotype. We could be viewing a SNP in the same LD block as the effect causing polymorphism. What is more informative, and indicative that the trained models identified true signals, is the comparison at the gene level. Overall, the genes where the SNPs with the highest importance as features in all models are located are consistent with the results of the GWAS analysis conducted in a recent study by Lee et al. [8], where the authors used the same dataset. Figure 12 summarizes their findings regarding AO, labeling with solid arrows loci that are related significantly to the phenotype, and including candidate modifier genes on top, in red those which are DNA maintenance genes. Comparing these external results obtained with a different methodology than ours with our Manhattan plots, we see that there are many similarities. There are 3 genes with at least 1 SNP which is a feature in all 5 models: the already mentioned FAN1 and MYT1L, both important AO modifier candidates, and CDYL2, another gene from GO:0140110 as MYT1L, related to the transcription regulation. While CDYL2 does not appear among potential AO modifiers, it is mentioned in a profiling thesis of MSNs from HD mouse models by Fenster (2011) [70]. In summary, the majority of SNPs and genes highlighted by the feature importance study summarized in this section are related to DNA maintenance machinery, aligning with the initial hypothesis [6], and the regulation of transcription. This reinforces the notion that the models have likely captured meaningful associations between genetic factors and AO. 36 Exploring ML approaches for phenotype prediction of HD Figure 11: AO over CAG repeat length of 500 random samples, colored by the presence of an alternative allele in rs141338757. Figure 12: GWAS by Lee et al. [8], showing significant associations between AO and different chromosome loci. 5.2.3 GO enrichment analysis Figure 13 illustrates the GO term proportions of the genes associated with all the SNPs used in each model, juxtaposed with the initial proportions of features in the data matrix used for training the models (background proportions). Lasso, Elastic Net, and Random Forest models, with approximately a thousand features each, closely resemble the largest background proportions. In contrast, the XGBoost methods exhibit GO term proportions that deviate more from the background and vary between them. Across all models, over 40% of the SNPs used as regressors belong to GO:0140110 (transcription regulator activity), followed by terms related to protein degradation and ubiquitin. The enrichment analysis shows the best performing XG- Boost model has three enriched terms: GO:0006298 (mismatch repair), GO:0046655 (folic acid metabolism) and the Extra genes group. This last group is enriched by all models, primarily due to multiple FAN1 SNPs. The term of transcription regulation is only significantly enriched in Lasso, Elastic Net and Approx XGBoost. See appendix D for the complete results of the enrichment test. 37 Exploring ML approaches for phenotype prediction of HD Figure 13: GO terms proportions of the features used in each model plus the background proportions. 38 6 Execution schedule In this section, we present the execution schedule for the project, which is meticulously structured to ensure timely and efficient completion. The schedule encompasses a set of project management tools: a comprehensive Work Breakdown Structure (WBS) to delineate the tasks, a Program Evaluation and Review Technique (PERT) chart for critical path analysis, and a GANTT chart to visually track activity progress over time. 6.1 Work Breakdown Structure The WBS delineates the project’s key components, providing a detailed roadmap for successful execution. Our WBS is organized into four main parts: project preparation, data preprocessing, machine learning regression models, and project writing. Each segment is broken down into specific tasks to facilitate a clear understanding of the project workflow and ensure systematic progress toward our goals. Figure 14 shows what activities are included in each of these main parts. These individual activites are described below, including their estimated duration. Figure 14: WBS diagram of the project. 39 Exploring ML approaches for phenotype prediction of HD 1 Project preparation 1.1 Project planning Duration: 1 day Set goals and time landmarks from the beginning of the project. Define the key activities of the project, where to start and how. Talk over the work methodology with the project tutor. 1.2 Bibliography research on HD Duration: 7 days Literature review on the HD investigation stage, find out what has already been done on the AO prediction challenge. It is essential to look for studies which explore different pathways that could be related to disease mechanisms, most importantly to disease onset. Review basic biology concepts related to SNPs. Keep track of the references consulted through the refenerce management software Mendeley. 1.3 Core genes selection Duration: 2 days From the literature review on HD mechanisms, assemble a list of genes to be included in the models, either by individual genes (specially HD genetic modifiers) or by groups of genes of a relevant pathway. 1.4 ML theory and packages research Duration: 5 days Having read about the bibologic setting of the project, review the knowledge on ML algorithms that can be useful towards achieving the project’s objective. 1.5 Data acquisition Duration: 1 day Ask the project’s director for the Enroll-HD data, and understand how to handle it through bash. 2 Data preprocessing 2.1 SNP look-up table generation Duration: 10 days Assemble a table containing all SNPs registered for each defined core gene. Search for possible tools which can return SNPs based on an input list of genes, and implement the best option. 40 Exploring ML approaches for phenotype prediction of HD 2.2 Feature encoding and filtering Duration: 20 days Assemble a first version of a feature matrix, filtering the original data using the look- up table generated in the activity 2.1 to reduce the size of the dataset in a way which makes the dataset more HD focused. Make the matrix have a ML compatible format: observations in rows (each subject being represented by a row), the genotype being encoded as an integer, clear headers. 2.3 Final feature matrix assembly Duration: 10 days Filter the feature matrix obtained in activity 2.2 by minimum alternative variant prevalence. Make a subset of this feature matrix for easy model testing, the toy example. 3 ML regression models 3.1 Model training with toy example Duration: 30 days Test the ML models proposed after defining the concept engineering in a notebook with the toy example. Make prototypes of the custom functions to read large matri- ces, to scale features and AO values, and to evaluate model training and performance. 3.2 Model training with entire dataset Duration: 15 days Make separate scripts for each model to execute separately with the entire dataset. Check that the resulting models are not overfitted, adjust the parameters of the grid search to obtain the best model possible for each technique. 3.3 Results extraction and discussion Duration: 5 days From the saved regressors obtained in activity 3.2, study which are the most im- portant features contributing to the predictions, and present the results in the most appropriate format for each model. Discuss the obtained results with the project’s directors, and check for positive controls between the most relevant features found across models. 4 Project writing 4.1 Background Duration: 7 days Commit to paper the literature review done at the beginning of the project. Com- plete throughout the project with new references found or with concepts which have become relevant at some point during the process. 4.2 Concept engineering Duration: 7 days After knowing what are the common practices in the environment where the project is set, and what are the available resources, state what are the realistic options for all the steps necessary to attain our objective. 41 Exploring ML approaches for phenotype prediction of HD 4.3 Detailed engineering Duration: 4 days Describe what has been done and how regarding data handling, transformation, feature selection, model training and results. 4.4 Results Duration: 4 days Write the results and discussions produced during the activity 3.3. 4.5 Viability analysis and regulatory aspects Duration: 2 days Think about the strengths and weaknesses of the project by creating a SWOT dia- gram. Complement the knowledge about the project external limits by searching for regulatory factors that may influence the applications of this project. 4.6 Execution schedule creation Duration: 1 days Create an execution schedule made up of a PERT diagram to control which activities should not be delayed, and a GANTT diagram to keep track of the activities that need to be done throughout the project. 4.7 Conclusions Duration: 2 days Write the final conclusions of the project once all activities are completed. 6.2 PERT diagram PERT is a project management tool utilized to analyze and depict the tasks involved in project completion. Table 6 outlines all activities along with their precedence re- lationships and estimated duration. The PERT diagram takes this relationships and represents them graphically, having each activity on an arrow. Each node is a project’s landmark, having two times: t early (left number), which is the time when the activities coming out of that node can be started, and t last (right number), the time when the activities pointing to that node should be finished to avoid delays in the project. The top number is simply a node identification. The critical path is the combination of activities along which the time that its activities can be started and the time when they should be finished is the same, meaning that any delay in these activities affects directly the total duration of the project. Our critical path is mainly formed by the initial literature review to select the core genes, the feature matrix assembly and the model training. Any activity outside this path can be postponed until a limit (before reaching the t last of the node to which they are directed) without altering the project’s total duration. 42 Exploring ML approaches for phenotype prediction of HD ID Activity name Dependencies Duration A Project planning - 1 B Bibliography research on HD - 7 C Core genes selection B 2 D ML theory research B 5 E Data acquisition - 1 F SNP look-up table generation C 10 G Feature encoding and filtering E,F 20 H Final feature matrix assembly G,M 10 I Model training with toy example H 30 J Model training with entire dataset I 15 K Results extraction and discussion J 5 L Background writing B 7 M Concept engineering B,D 7 N Detailed engineering writing J 4 O Results writing K 4 P Viability analysis and regulatory aspects M 2 Q Execution schedule creation A 1 R Conclusions L,N,O,P,Q 2 Table 6: Activity table for the PERT diagram. Figure 15: PERT diagram of the project. 43 Exploring ML approaches for phenotype prediction of HD 6.3 GANTT diagram Figure 16: GANTT diagram of the project schedule, starting the 29th of January and ending the 28th of May. 44 7 Technical viability To assess the technical viability of this research project a SWOT analysis is presented. With it, we identify the strengths, weaknesses, opportunities and threats associated with the technical aspects of the project in a structured framework to evaluate the internal and external factors that could influence the project’s success from a techno- logical standpoint. The diagram which summarizes the key points of the analysis is presented in Figure 17. Figure 17: SWOT analysis of the project. By examining the strengths, we aim to highlight the technical capabilities and resources that give the project a competitive edge. We can consider that the two major strengths of the project are the dataset to which we have been given access and the professional setting of HD researchers involved. The dataset contains information on all autoso- mal chromosomes and a large number of SNPs per gene, which provides us with the opportunity to expand our research without having information limits and reveal new genetic modifiers. Additionally, the project is situated in a professional environment with experienced HD researchers whom we can consult to check the validity of our lit- erature findings, which is crucial for conducting a ML exploration with solid biological foundations that can yield significant results disease-wise. My previous knowledge of ML applications through Python also contributes to the strength of the project, making 45 Exploring ML approaches for phenotype prediction of HD the model training phase quicker. Conversely, identifying the weaknesses allows us to recognize the internal limitations and challenges that could impede progress. The most significant weakness of the project is related to one of our main strengths: the dimensionality of our data. Although it gives us the power to study the whole genome, the computational resources needed to include the entire genome are not available, necessitating a restriction on what infor- mation we keep, which could potentially leave out important modifier points in the process. Moreover, although we have institutional support for the biological aspect of the research, I personally lack initial knowledge of the disease. Understanding the biological aspect of the disease is key at the start of the project to choose the processes on which to limit our search. This step can be biased due to a short period to get acquainted with the disease, the literature on it, and the state-of-the-art procedures in similar cases. Lastly, the results we might obtain are not as conclusive as an experimen- tal study, which can empirically prove that a potential GeM has a true effect on disease onset. We can only explain the results based on how each model gives importance to each feature and provide possible biological theories on how that effect could occur. The opportunities section focuses on external factors and trends that the project can leverage to its advantage. The principal opportunity is that the algorithms tried in this project are very simple. If they can improve the baseline, it is likely that a more complex algorithm will be able to improve the AO prediction much more, especially algorithms that model non-linear interactions such as Neural Networks. We could also expand the model to include other processes and genes once the pipeline of the process toward model training is created. An AO prediction algorithm that provides more accurate predictions than the standard Langbehn’s formula could potentially become a new clinical prediction tool in genetic counseling. It would also be a very useful tool in research, allowing us to control for the effect of the already known modifiers to test the effect of more polymorphisms and interactions between them. Finally, the threats analysis considers external challenges and risks that could nega- tively impact the project’s technical viability. The most direct threat to the research is losing our originality aspect, as the technology used is very simple and easily accessible, meaning that what we are doing can be done by many others simultaneously or even before us. If the results achieved are good enough to propose a new AO prediction tool for clinical application, the regulations surrounding technology testing and transfer are very strict and require long, demanding procedures. 46 8 Economic viability Three main components are essential to this project: data, technical resources, and human resources. Data for this project was graciously provided by CHDI Foundation, Inc. as part of the Enroll-HD study. CHDI Foundation is a nonprofit biomedical research organization dedicated to collaboratively developing therapeutics for Huntington’s disease (HD). The economic costs associated with data collection are not within the scope of this project and therefore are not factored into our analysis. Due to the substantial memory requirements of the data, analysis cannot be performed using standard laptops. Instead, a computationally capable computer is necessary. Most of the data manipulation and analysis were conducted on a remote computational cluster provided by Creatio, a service that had been paid for years prior to the initiation of this project. To estimate the cost of this service, we referenced the pricing of Amazon Elastic Compute Cloud (Amazon EC2) from Amazon Web Services. Based on our execution schedule (refer to section 6), we estimate that we utilized the cluster for approximately 93 days. Assuming an average of 3 hours of code execution per day, with a computational memory requirement of 10GB (or 10.74 GiB) and a Linux operating system, we found that the most cost-effective instance is the t3.xlarge [71]. Opting for the EC2 Instance Savings Plan for the minimum duration (1 year) would result in a total cost of 1,287.72 USD, or alternatively, selecting the on-demand option would amount to approximately 656.92 USD for four months, making it the more economical choice. Notably, the personal laptop used to access the remote computational resources is not considered a project expense. Regarding human resources, the project team consists of myself and the supervising tutor. As we were not directly compensated for our involvement in the project, human resources are not considered as an economic limitation. In conclusion, the primary economic consideration for this project is the acquisition of a sufficiently powerful computing machine. If Creatio did not possess such a resource prior to the project, the estimated total cost would approximate 656.92 USD. 47 9 Regulatory and legal aspects 9.1 Enroll-HD data usage The dataset used (“GWAS12345” data) involving human subjects is shared by the CHDI foundation with qualified investigators given their institutional assurance of sub- ject confidentiality and compliance with General Data Protection Regulation (GDPR) 2016/679 requirements with respect to personal data. This regulation ensures the pro- tection and privacy of individuals’ data, particularly in sensitive research areas such as HD. GDPR mandates rigorous controls over data handling, requiring that all personal data be processed lawfully, transparently, and for a specific purpose [72]. The use of the data under GDPR involves several key considerations. Firstly, the data must be anonymized or pseudonymized to protect the identity of the subjects. We already received the data with anonymous identifications for each subject. We must ensure that data usage is restricted to the stated research objectives when we asked for the dataset to the foundation, and any deviation would require additional consent. Furthermore, data security measures are implemented to prevent unauthorized access or breaches, saving the data in password-protected machines. The principles of data minimization and purpose limitation under GDPR mean that only the data necessary for the ML models’ development and validation should be utilized. This compliance not only protects the participants’ privacy but also enhances the ethical integrity of the research. Therefore, throughout the thesis, stringent adherence to GDPR guidelines will be maintained to uphold the trust and confidentiality placed by the CHDI Foundation and the study participants. 9.2 Medical algorithms legislation The models created throughout this thesis only have research purposes. If this was to be applied as clinician support, several additional steps should be taken to ensure compliance with the relevant regulatory frameworks. Firstly, the ML predictive soft- ware would need to be evaluated to determine if it qualifies as a “Software as a Medical Device” (SaMD) under the Regulation (EU) 2017/745 – Medical Device Regulation (MDR) [73]. This regulation stipulates that any software intended to process, analyze, create, or modify medical information must adhere to strict standards to ensure safety, performance, and compliance with intended medical purposes. A thorough regulatory assessment would be necessary to classify the software appropriately and to confirm it meets all applicable requirements. By addressing these regulatory aspects, the transi- tion from research to clinical application can be achieved responsibly, ensuring that the benefits of the ML models are realized without compromising patient safety or legal compliance. 48 10 Conclusions and future outlook This thesis explores the potential of ML models to predict the AO in HD using geno- typing data, aiming to identify genetic variants significantly contributing to AO predic- tion. The study benchmarks various regression models, comparing their performance to a baseline model that relied on CAG repeat length and sex. The baseline OLS model, using CAG repeat length and sex as predictors, explained approximately 56% of the variance in AO. Among the tested models, the XGBoost model using the histogram tree method achieved the best performance, with an R2 of 0.5908, improving the baseline by 3%. Other tree-based methods like Random Forest and approximate XGBoost also performed well, while regularized linear models like Lasso and Elastic Net did not outperform the baseline although they were able to rediscover known GeMs and propose new candidates. The better performance of tree- based methods suggests that the relationship between genetic information and AO is non-linear and that these models are better suited for capturing complex interactions among genetic features. This points towards a future research direction of exploring algorithms which can better fit these non-linear interactions such as Neural Networks. Given the limited number of observations compared to the number of features, a strategy to overcome this fundamental problem was required. To that end, we have implemented regularization, cross-validation, and ensemble techniques, successfully addressing the problem in all models. CAG repeat length consistently emerged as the most significant predictor across all models. Several SNPs were identified as important across different models. Notably, SNPs in genes such as FAN1, MLH1, TCF7L2, ESRRB, and MYT1L were highlighted. The identified SNPs are consistent with known GeMs of HD and mechanisms related to DNA maintenance, transcription regulation, and protein degradation, reinforcing the validity of the models’ feature selection. We found that the effect of individual SNPs on AO was relatively small, indicating that AO variability cannot be attributed to single genetic variants alone but rather a combination of multiple genetic factors. Gene Ontology enrichment analysis revealed that the most important SNPs identified by the models were associated with genes involved in transcription regulation, protein degradation, and DNA mismatch repair. Notably, genes related to DNA maintenance machinery were consistently highlighted, aligning with existing literature on GeMs of HD. The best performing XGBoost model, in particular, significantly enriched the terms of mismatch repair and folic acid metabolism, plus the group containing other possible GeMs. One important assumption worth noting is that we encoded the SNP genotype by considering the effect of a homozygous SNP for the alternative allele to be double the effect of a heterozygous SNP. This assumption could be biologically inaccurate. Another simplification is that the CAG repeat length used here is not representative of the actual CAG repeat length found in patients, as it is an unstable mutation that undergoes somatic expansions differently in various tissues. These expansions are thought to accelerate the disease process [19], so controlling for different CAG lengths in different 49 Exploring ML approaches for phenotype prediction of HD tissues other than blood and considering the age at which the mutation was genotyped could improve the AO prediction models. Finally, the CAG length typically used in AO prediction, including in this project, is the longest mutation of the two chromosomes, which is the dominant one. However, studies suggest there may be an interaction between the two alleles [4]. Although the primary focus of this thesis was on the effect of SNPs rather than CAG length, uncontrolled variables could still influence the phenotype we are trying to explain. In conclusion, the application of machine learning models to predict AO in HD us- ing genotypic data demonstrated that incorporating additional genetic features beyond CAG repeat length can modestly improve prediction accuracy. The study confirmed the relevance of known GeMs and identified new potential candidates for further investiga- tion. These findings enhance our understanding of the genetic factors influencing HD onset and highlight the complexity of the disease’s genetic architecture. Future research could focus on refining these models and exploring the biological mechanisms under- lying the identified genetic associations to develop targeted interventions for delaying HD onset. 50 11 Bibliography [1] M. Duyao, A. Lazzarini, A. Falek, et al., Trinucleotide repeat length instability and age of onset in huntington’s disease, 1993. [Online]. Available: http://www. nature.com/naturegenetics. [2] E. M. Gatto, N. G. Rojas, G. Persi, J. L. Etcheverry, M. E. Cesarini, and C. Perandones, “Huntington disease: Advances in the understanding of its mecha- nisms,” Clinical Parkinsonism & Related Disorders, vol. 3, p. 100 056, 2020, issn: 25901125. doi: 10.1016/j.prdoa.2020.100056. [3] J.-L. Li, M. R. Hayden, E. W. Almqvist, et al., A genome scan for modifiers of age at onset in huntington disease: The hd maps study, 2003. [4] L. Djousse´, B. Knowlton, M. Hayden, et al., “Interaction of normal and expanded cag repeat sizes influences age at onset of huntington disease,” American Journal of Medical Genetics, vol. 119 A, pp. 279–282, 3 Jun. 2003, issn: 15524825. doi: 10.1002/ajmg.a.20190. [5] J. M. Lee, V. C. Wheeler, M. J. Chao, et al., “Identification of genetic factors that modify clinical onset of huntington’s disease,” Cell, vol. 162, pp. 516–526, 3 Aug. 2015, issn: 10974172. doi: 10.1016/j.cell.2015.07.003. [6] P. A. Holmans, T. H. Massey, and L. Jones, Genetic modifiers of mendelian disease: Huntington’s disease and the trinucleotide repeat disorders, Oct. 2017. doi: 10.1093/hmg/ddx261. [7] J. F. Gusella and M. E. MacDonald, “Huntington’s disease: The case for genetic modifiers,” Genome Medicine, Aug. 2009. [8] J. M. Lee, Y. Huang, M. Orth, et al., “Genetic modifiers of huntington disease differentially influence motor and cognitive domains,” American Journal of Hu- man Genetics, vol. 109, pp. 885–899, 5 May 2022, issn: 15376605. doi: 10.1016/ j.ajhg.2022.03.004. [9] R. Ghosh, S. J. Tabrizi, C. No´brega, and L. P. de Lameida, Polyglutamine Dis- orders. Springer, Jan. 2018, pp. 1–28. doi: https://doi.org/10.1007/978-3- 319-71779-1. [10] A. A. Hubers, N. Reedeker, E. J. Giltay, R. A. Roos, E. V. Duijn, and R. C. V. D. Mast, “Suicidality in huntington’s disease,” Journal of Affective Disorders, vol. 136, pp. 550–557, 3 Feb. 2012, issn: 01650327. doi: 10.1016/j.jad.2011.10.031. [11] A. Reiner, R. L. Albin, K. D. Anderson, C. J. D’amato, J. B. Penneyt, and A. B. Youngt, Differential loss of striatal projection neurons in huntington disease, 1988. [Online]. Available: https://www.pnas.org. [12] J.-P. Vonsattel, R. H. Myers, T. J. Stevens, R. J. Ferrante, E. D. Bird, and E. P. Richardson, Neuropathological classification of huntinqton’s disease, 1985. [Online]. Available: https://academic.oup.com/jnen/article/44/6/559/ 2613653. [13] P. C. Nopoulos, E. H. Aylward, C. A. Ross, et al., “Cerebral cortex structure in prodromal huntington disease,” Neurobiology of Disease, vol. 40, pp. 544–554, 3 Dec. 2010, issn: 09699961. doi: 10.1016/j.nbd.2010.07.014. 51 Exploring ML approaches for phenotype prediction of HD [14] J. MacMillan, R. Snell, A. Tyler, et al., “Molecular analysis and clinical correla- tions of the huntington’s disease mutation,” The Lancet, vol. 342, pp. 954–958, 8877 Oct. 1993, issn: 01406736. doi: 10.1016/0140-6736(93)92002-B. [15] G. P. Bates, R. Dorsey, J. F. Gusella, et al., Huntington disease, Apr. 2015. doi: 10.1038/nrdp.2015.5. [16] J. F. Gusella, J. M. Lee, and M. E. Macdonald, Huntington’s disease: Nearly four decades of human molecular genetics, Oct. 2021. doi: 10.1093/hmg/ddab170. [17] V. C. Wheeler, W. Auerbach, J. K. White, et al., Length-dependent gametic cag repeat instability in the huntington’s disease knock-in mouse, 1999. [Online]. Avail- able: https://academic.oup.com/hmg/article/8/1/115/2356046. [18] L. Kennedy, E. Evans, C. M. Chen, et al., “Dramatic tissue-specific mutation length increases are an early molecular event in huntington disease pathogenesis,” Human Molecular Genetics, vol. 12, pp. 3359–3367, 24 Dec. 2003, issn: 09646906. doi: 10.1093/hmg/ddg352. [19] R. M. Pinto, E. Dragileva, A. Kirby, et al., “Mismatch repair genes mlh1 and mlh3 modify cag instability in huntington’s disease mice: Genome-wide and candidate approaches,” PLoS Genetics, vol. 9, 10 Oct. 2013, issn: 15537390. doi: 10.1371/ journal.pgen.1003930. [20] M. Ciosi, A. Maxwell, S. A. Cumming, et al., “A genetic association study of glutamine-encoding dna sequence structures, somatic cag expansion, and dna re- pair gene variants, with huntington disease clinical outcomes,” EBioMedicine, vol. 48, pp. 568–580, Oct. 2019, issn: 23523964. doi: 10.1016/j.ebiom.2019. 09.020. [21] J.-M. Lee, E. M. Ramos, J.-H. Lee, et al., Cag repeat expansion in huntington dis- ease determines age at onset in a fully dominant fashion, 2012. [Online]. Available: www.neurology.org. [22] H. Vitet, V. Brandt, and F. Saudou, Traffic signaling: New functions of huntingtin and axonal transport in neurological disease, Aug. 2020. doi: 10.1016/j.conb. 2020.04.001. [23] A. S. Dickey and A. R. L. Spada, Therapy development in huntington disease: From current strategies to emerging opportunities, Apr. 2018. doi: 10.1002/ ajmg.a.38494. [24] J. Schulte and J. T. Littleton, “The biological function of the huntingtin protein and its relevance to huntington’s disease pathology,” Curr Trends Neurol, vol. 5, pp. 65–78, 2011. [25] E. Scherzinger, A. Sittler, K. Schweiger, et al., Self-assembly of polyglutamine- containing huntingtin fragments into amyloid-like fibrils: Implications for hunt- ington’s disease pathology (huntingtinglutamine repeataggregation), 1999. [Online]. Available: www.pnas.org.. [26] A. Vignal, D. Milan, M. SanCristobal, and A. Eggen, A review on snp and other types of molecular markers and their use in animal genetics, 2002. doi: 10.1051/ gse:2002009. 52 Exploring ML approaches for phenotype prediction of HD [27] G. Hettiarachchi and A. A. Komar, “Gwas to identify snps associated with com- mon diseases and individual risk: Genome wide association studies (gwas) to identify snps associated with common diseases and individual risk,” in Springer International Publishing, Aug. 2022, pp. 51–76, isbn: 9783031056161. doi: 10. 1007/978-3-031-05616-1_4. [28] H. Oettgen and D. H. Broide, “Introduction to mechanisms of allergic disease,” in Elsevier, Nov. 2011, pp. 1–32, isbn: 9780702045707. doi: 10.1016/B978-0- 7234-3658-4.00005-6. [29] J. Broholm, J.-J. Cassiman, D. Craufurd, et al., “Guidelines for the molecular ge- netics predictive test in huntington’s disease. international huntington association (iha) and the world federation of neurology (wfn) research group on huntington’s chorea.,” J Med Genet, vol. 31, pp. 555–559, 7 Jul. 1994. [30] D. R. Langbehn, R. R. Brinkman, D. Falush, J. S. Paulsen, and M. R. Hayden, “A new model for prediction of the age of onset and penetrance for huntington’s disease based on cag length,” Clinical Genetics, vol. 65, pp. 267–277, 4 Apr. 2004, issn: 00099163. doi: 10.1111/j.1399-0004.2004.00241.x. [31] D. R. Langbehn, M. R. Hayden, J. S. Paulsen, et al., Cag-repeat length and the age of onset in huntington disease (hd): A review and validation study of statistical approaches, Mar. 2010. doi: 10.1002/ajmg.b.30992. [32] J. S. Paulsen, M. Hayden, J. C. Stout, et al., “Preparing for preventive clinical trials the predict-hd study,” Arch Neurol, vol. 63, pp. 883–890, 3 2006. doi: https://doi.org/10.1001/archneur.63.6.883. [Online]. Available: http: //archneur.jamanetwork.com/. [33] J. S. Paulsen, D. R. Langbehn, J. C. Stout, et al., “Detection of huntington’s disease decades before diagnosis: The predict-hd study,” Journal of Neurology, Neurosurgery and Psychiatry, vol. 79, pp. 874–880, 8 2008, issn: 1468330X. doi: 10.1136/jnnp.2007.128728. [34] S. Klo¨ppel, C. Chu, G. C. Tan, et al., Automatic detection of preclinical neurode- generation presymptomatic huntington disease, 2009. [Online]. Available: http: //www.cmmt.ubc.ca/clinical/hayden. [35] J. A. Ciarochi, V. D. Calhoun, S. Lourens, et al., “Patterns of co-occurring gray matter concentration loss across the huntington disease prodrome,” Frontiers in Neurology, vol. 7, SEP Sep. 2016, issn: 16642295. doi: 10.3389/fneur.2016. 00147. [36] D. Bzdok, M. Krzywinski, and N. Altman, Points of significance: Machine learn- ing: A primer, Dec. 2017. doi: 10.1038/nmeth.4526. [37] Z. Ghahramani, Probabilistic machine learning and artificial intelligence, May 2015. doi: 10.1038/nature14541. [38] D. Bzdok, N. Altman, and M. Krzywinski, Points of significance: Statistics versus machine learning, Apr. 2018. doi: 10.1038/nmeth.4642. 53 Exploring ML approaches for phenotype prediction of HD [39] P. P. Silva, J. D. Gaudillo, J. A. Vilela, et al., “A machine learning-based snp-set analysis approach for identifying disease-associated susceptibility loci,” Scientific Reports, vol. 12, 1 Dec. 2022, issn: 20452322. doi: 10.1038/s41598-022-19708- 1. [40] P. Harper, C. Lim, C. David, and P. Disease, “Ten years of presymptomatic testing for huntington’s disease: The experience of the uk huntington’s disease prediction consortium,” J Med Genet, vol. 37, pp. 567–571, 8 Aug. 2000. doi: 10.1136/jmg.37.8.567. [41] A. Drouin, G. Letarte, F. Raymond, M. Marchand, J. Corbeil, and F. Laviolette, “Interpretable genotype-to-phenotype classifiers with performance guarantees,” Scientific Reports, vol. 9, 1 Dec. 2019, issn: 20452322. doi: 10.1038/s41598- 019-40561-2. [42] S. Szymczak, J. M. Biernacka, H. J. Cordell, et al., “Machine learning in genome- wide association studies,” vol. 33, 2009. doi: 10.1002/gepi.20473. [43] A. Ziegler, A. L. DeStefano, and I. R. Ko¨nig, “Data mining, neural nets, trees - problems 2 and 3 of genetic analysis workshop 15,” vol. 31, 2007. doi: 10.1002/ gepi.20280. [44] S. Sathe, J. Ware, J. Levey, et al., Enroll-hd: An integrated clinical research plat- form and worldwide observational study for huntington’s disease, Aug. 2021. doi: 10.3389/fneur.2021.667420. [45] J. Ouwerkerk, S. Feleus, K. F. van der Zwaan, et al., “Machine learning in hunt- ington’s disease: Exploring the enroll-hd dataset for prognosis and driving capa- bility prediction,” Orphanet Journal of Rare Diseases, vol. 18, 1 Dec. 2023, issn: 17501172. doi: 10.1186/s13023-023-02785-4. [46] J. Ko, H. Furby, X. Ma, et al., “Clustering and prediction of disease progression trajectories in huntington’s disease: An analysis of enroll-hd data using a machine learning approach,” Frontiers in Neurology, Jan. 2023. doi: 10.3389/fneur. 2022.1034269. [47] J. M. Lee, K. Correia, J. Loupe, et al., “Cag repeat not polyglutamine length determines timing of huntington’s disease onset,” Cell, vol. 178, 887–900.e14, 4 Aug. 2019, issn: 10974172. doi: 10.1016/j.cell.2019.06.036. [48] S. Carbon, A. Ireland, C. J. Mungall, S. Shu, B. Marshall, and S. Lewis, “Amigo: Online access to ontology and annotation data,” Bioinformatics, vol. 25, pp. 288– 289, 2 Jan. 2009, issn: 1367-4811. doi: 10.1093/bioinformatics/btn615. [49] D. Szklarczyk, A. L. Gable, D. Lyon, et al., “String v11: Protein-protein as- sociation networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.,” Nucleic acids research, vol. 47, pp. D607– D613, D1 Jan. 2019, issn: 1362-4962. doi: 10.1093/nar/gky1131. [50] S. Cherlin, R. A. Howey, and H. J. Cordell, “Using penalized regression to predict phenotype from snp data,” vol. 12, BioMed Central Ltd., Sep. 2018. doi: 10. 1186/s12919-018-0149-2. [51] A. Mele´ndez, C. Lo´pez, D. Bonet, et al., “Assessing tree-based phenotype predic- tion on the uk biobank,” 2023. 54 Exploring ML approaches for phenotype prediction of HD [52] P. Roback and J. Legler, Beyond Multiple Linear Regression: Applied Generalized Linear Models and Multilevel Models in R, 1st ed. Chapman and Hall/CRC, 2021. [53] M. Rouaud, Probability, Statistics and Estimation: Propagation of Uncertainties in Experimental Measurement. 2013. [54] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 2nd ed. Springer, 2009. [55] J. Friedman, T. Hastie, and R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, 2010. [Online]. Available: http://www. jstatsoft.org/. [56] M. Ghazwani and M. Y. Begum, “Computational intelligence modeling of hyoscine drug solubility and solvent density in supercritical processing: Gradient boosting, extra trees, and random forest models,” Scientific Reports, vol. 13, 1 Dec. 2023, issn: 20452322. doi: 10.1038/s41598-023-37232-8. [57] J. Elith, J. R. Leathwick, and T. Hastie, A working guide to boosted regression trees, Jul. 2008. doi: 10.1111/j.1365-2656.2008.01390.x. [58] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning : with Applications in R, 2nd ed. Springer, Jun. 2013. [59] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” Mar. 2016. doi: 10.1145/2939672.2939785. [Online]. Available: http://arxiv.org/abs/ 1603.02754%20http://dx.doi.org/10.1145/2939672.2939785. [60] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Pro- ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, ser. KDD ’16, San Francisco, California, USA: ACM, 2016, pp. 785–794, isbn: 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. [Online]. Available: http://doi.acm.org/10.1145/2939672.2939785. [61] F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [62] F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [63] K. A. Sap, K. W. Geijtenbeek, S. Schipper-Krom, A. T. Guler, and E. A. Reits, Ubiquitin-modifying enzymes in huntington’s disease, Feb. 2023. doi: 10.3389/ fmolb.2023.1107323. [64] D. Moss, A. Pardin˜as, D. Langbehn, et al., “Identification of genetic variants asso- ciated with huntington’s disease progression: A genome-wide association study,” The Lancet Neurology, vol. 16, pp. 701–711, 9 2017. doi: 10 . 1016 / S1474 - 4422(17)30161-8. [Online]. Available: https://u-picardie.hal.science/ hal-04073171. [65] B. McAllister, J. Donaldson, C. S. Binda, et al., “Exome sequencing of individ- uals with huntington’s disease implicates fan1 nuclease activity in slowing cag expansion and disease onset,” Nature Neuroscience, vol. 25, pp. 446–457, 4 Apr. 2022, issn: 15461726. doi: 10.1038/s41593-022-01033-5. 55 Exploring ML approaches for phenotype prediction of HD [66] J. Cheng, H. P. Liu, W. Y. Lin, and F. J. Tsai, “Identification of contributing genes of huntington’s disease by machine learning,” BMC Medical Genomics, vol. 13, 1 Dec. 2020, issn: 17558794. doi: 10.1186/s12920-020-00822-w. [67] A. Benraiss, J. N. Mariani, A. Tate, et al., “A tcf7l2-responsive suppression of both homeostatic and compensatory remyelination in huntington disease mice,” Cell Reports, vol. 40, 9 Aug. 2022, issn: 22111247. doi: 10.1016/j.celrep. 2022.111291. [68] W. J. Szlachcic, P. M. Switonski, W. J. Krzyzosiak, M. Figlerowicz, and M. Figiel, “Huntington disease ipscs show early molecular changes in intracellular signaling, the expression of oxidative stress proteins and the p53 pathway,” DMM Disease Models and Mechanisms, vol. 8, pp. 1047–1057, 9 Sep. 2015, issn: 17548411. doi: 10.1242/dmm.019406. [69] S. V. Lobanov, B. McAllister, M. McDade-Kumar, et al., “Huntington’s disease age at motor onset is modified by the tandem hexamer repeat in tcerg1,” npj Genomic Medicine, vol. 7, 1 Dec. 2022, issn: 20567944. doi: 10.1038/s41525- 022-00317-w. [70] R. J. Fenster, “Cell-type specific translational profiling in huntington’ s disease mouse models,” 2011. [Online]. Available: http://digitalcommons.rockefeller. edu/. [71] Amazon, Aws pricing calculator. [Online]. Available: https://calculator.aws/ #/createCalculator/ec2-enhancement. [72] Council of the European Union and European Parliament, Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation). [73] Council of the European Union and European Parliament, Regulation (eu) 2017/745 of the european parliament and of the council of 5 april 2017 on medical devices, amending directive 2001/83/ec, regulation (ec) no 178/2002 an. 56 A Appendix A Figure A.1: Age distribution in training and testing sets. Figure A.2: Number of samples in training and testing sets separated by sex. 57 Exploring ML approaches for phenotype prediction of HD Figure A.3: CAG repeat length distribution in training and testing sets separated by sex. 58 B Appendix B B.1 Training predictions (a) Lasso model. (b) Elastic Net model. Figure B.1: Least squares methods predictions over training set. (a) Hist XGBoost model. (b) Approx XGBoost model. Figure B.2: XGBoost predictions over training set. 59 Exploring ML approaches for phenotype prediction of HD Figure B.3: Random Forest predictions over training set. 60 Exploring ML approaches for phenotype prediction of HD B.2 Testing predictions Figure B.4: Elastic Net predictions of testing set. Figure B.5: Random Forest predictions of testing set. 61 Exploring ML approaches for phenotype prediction of HD Figure B.6: Approx XGBoost predictions of testing set. 62 C Appendix C C.1 Manhattan Plots Figure C.1: Manhattan Plot of Elastic Net coefficients. 63 Exploring ML approaches for phenotype prediction of HD Figure C.2: Manhattan Plot of Random Forest. Figure C.3: Manhattan Plot of Approx XGBoost. 64 Exploring ML approaches for phenotype prediction of HD C.2 Feature importance tables SNP Coefficient Gene GO rs17173770 0.03656065 SMARCD3 GO:0140110 rs61997076 0.033996664 FAN1 extra genes rs144287831 0.029233377 MLH1 GO:0006298 rs10885398 -0.028514741 TCF7L2 GO:0140110 rs8017707 -0.026800904 ESRRB GO:0140110 rs1543108 -0.025090193 CTSO GO:0004197 rs2278687 0.023999402 ESRRB GO:0140110 rs1650649 -0.022265296 MSH3 GO:0006298 rs5743063 -0.021884425 PMS1 GO:0006298 rs7500197 0.021786358 GRID2 GO:0035249 rs2025533 -0.020597318 KLF12 GO:0140110 rs11293 0.01907387 ABCA1 GO:0042157 rs12531177 0.018855557 PMS2 GO:0006298 rs35782477 0.018614033 ANAPC13 GO:0043161 rs2262948 0.017807763 ZNF675 GO:0140110 rs2500286 -0.017432597 PRDM16 GO:0140110 rs9653966 -0.017269952 XPC GO:0006298 rs2875957 -0.017144997 CDYL GO:0140110 rs9312893 0.016999085 FBXL7 GO:0043161 rs17302697 0.016957078 NR3C1 GO:0035249 rs6856354 0.016785666 CTSO GO:0004197 rs3740405 0.015683452 FHIT GO:0031625 rs12422904 0.01560022 FOXN4 GO:0140110 rs201804677 -0.01556026 ZNF431 GO:0140110 rs2972071 -0.015510717 RTN4 GO:0031625 rs7316068 -0.015349336 SLC17A8 GO:0035249 rs3132292 -0.015137427 RXRA GO:0140110 rs6987733 0.015124317 UBR5 GO:0043130 rs3512 0.014648994 AGBL4 GO:0098930 rs274883 0.014632466 LIG1 GO:0006298 rs13208655 -0.014536453 JARID2 GO:0043130 Table 7: Lasso top 30 SNPs by feature importance. 65 Exploring ML approaches for phenotype prediction of HD SNP Coefficient Gene GO rs17173770 0.03365854 SMARCD3 GO:0140110 rs61997076 0.03142683 FAN1 extra genes rs10885398 -0.029996023 TCF7L2 GO:0140110 rs8017707 -0.02911162 ESRRB GO:0140110 rs144287831 0.026012165 MLH1 GO:0006298 rs2278687 0.025964953 ESRRB GO:0140110 rs1543108 -0.024854636 CTSO GO:0004197 rs1650649 -0.023066632 MSH3 GO:0006298 rs7500197 0.022596601 GRID2 GO:0035249 rs5743063 -0.022342289 PMS1 GO:0006298 rs1917533 -0.021176575 ACTN1 GO:0140110 rs35782477 0.020563638 ANAPC13 GO:0043161 rs2025533 -0.020403964 KLF12 GO:0140110 rs201804677 -0.019823277 ZNF431 GO:0140110 rs9312893 0.019517148 FBXL7 GO:0043161 rs2262948 0.019241005 ZNF675 GO:0140110 rs12531177 0.019214965 PMS2 GO:0006298 rs2500286 -0.019203568 PRDM16 GO:0140110 rs11293 0.018643003 ABCA1 GO:0042157 rs2875957 -0.018634377 CDYL GO:0140110 rs3132292 -0.01829819 RXRA GO:0140110 rs3512 0.017774522 AGBL4 GO:0098930 rs12422904 0.016952137 FOXN4 GO:0140110 rs199779315 -0.016934201 ZBTB7C GO:0140110 rs9653966 -0.016681628 XPC GO:0006298 rs17227161 0.016438538 RUNX1 GO:0140110 rs2118638 -0.016436614 TCERG1L GO:0140110 rs3740405 0.01640511 FHIT GO:0031625 rs2777803 0.016330175 ABCA1 GO:0042157 rs7316068 -0.01607368 SLC17A8 GO:0035249 rs6856354 0.016003186 CTSO GO:0004197 Table 8: Elastic Net top 30 SNPs by feature importance. 66 Exploring ML approaches for phenotype prediction of HD SNP Gini Importance Gene GO rs77752857 0.001420583 NEK6 GO:0031625 rs1043742 0.000995081 CUL2 GO:0043161 rs11201880 0.000837296 GRID1 GO:0035249 rs10169129 0.000586326 MYT1L GO:0140110 rs342 0.000577957 ABCB1 GO:0031625 rs10516927 0.000570458 GRID2 GO:0035249 rs143394620 0.000508776 EGFR GO:0031625 rs118089305 0.000503646 FAN1 extra genes rs17767868 0.000498691 ZFAT GO:0140110 rs17760586 0.000495167 ZFAT GO:0140110 rs6835277 0.000484814 MAML3 GO:0140110 rs11293 0.000446546 ABCA1 GO:0042157 rs56141370 0.000431173 NEK6 GO:0031625 rs185978838 0.000424241 PGAP1 GO:0042157 rs7986341 0.000419989 KLF12 GO:0140110 rs62206880 0.000419746 ASXL1 GO:0140110 rs146175915 0.000415588 CDYL2 GO:0140110 rs138114240 0.000404046 HHAT GO:0042157 rs3512 0.000400071 AGBL4 GO:0098930 rs4297916 0.00039359 MYT1L GO:0140110 rs149682363 0.000386536 MAF GO:0140110 rs191435283 0.000355128 EXT1 GO:0035249 rs111434006 0.000333013 MSH3 GO:0006298 rs6582088 0.000332103 TRHDE GO:0008242 rs181825 0.000320682 ADARB1 GO:0051402 rs903372 0.000305975 LRRFIP1 GO:0140110 rs147241096 0.000285983 CUL2 GO:0043161 rs17703967 0.000284305 ASB4 GO:0031625 rs141939884 0.000273324 LPIN1 GO:0140110 rs111506732 0.000272638 TRHDE GO:0008242 rs60026188 0.000270552 MYT1L GO:0140110 Table 9: Random Forest top 30 SNPs by feature importance. 67 Exploring ML approaches for phenotype prediction of HD SNP Gain Gene GO rs3512 19.99822235107422 AGBL4 GO:0098930 rs115282897 17.450469970703125 PGAP1 GO:0042157 rs10169129 16.90975570678711 MYT1L GO:0140110 rs2160644 15.064666748046875 PID1 GO:0006112 rs245100 14.723200798034668 DHFR GO:0046655 rs62142184 13.985157012939453 CAPN13 GO:0004197 rs7487625 13.919387817382812 HMGA2 GO:0140110 rs61997076 13.325885772705078 FAN1 extra genes rs57013064 13.008210182189941 CDYL2 GO:0140110 rs118089305 12.423885345458984 FAN1 extra genes rs11678557 11.876806259155273 PID1 GO:0006112 rs116530043 10.932844161987305 GRIN3B GO:0035249 rs113363660 10.814685821533203 ACTN1 GO:0140110 rs112574961 10.742321968078613 CASZ1 GO:0140110 rs607549 10.549150466918945 CIT GO:0051402 rs8100660 10.517244338989258 UNC13A GO:0035249 rs11487218 10.287309646606445 EGFR GO:0031625 rs72841819 9.312403678894043 BTRC GO:0043161 rs4083635 9.14719009399414 PDLIM1 GO:0140110 rs1908771 8.820863723754883 AEBP2 GO:0140110 rs151067506 8.694870948791504 TEAD1 GO:0140110 rs141338757 8.484679222106934 HTT extra genes rs17720191 8.309086799621582 FBXO38 GO:0043161 rs1860968 7.850630760192871 DTNBP1 GO:0098930 rs111434006 7.695502281188965 MSH3 GO:0006298 rs2536065 7.573413848876953 PRKAG2 GO:0006112 rs893507 7.270939826965332 TCERG1L GO:0140110 rs77047816 7.227169036865234 ZHX2 GO:0140110 rs5743061 6.460529327392578 PMS1 GO:0006298 rs146339947 5.831992149353027 ZNF670 GO:0140110 rs999449 5.206699848175049 RORA GO:0140110 rs12380722 4.913415908813477 KDM4C GO:0140110 rs149622839 4.911685943603516 DHFR GO:0046655 Table 10: Approx XGBoost SNPs by feature importance. 68 Exploring ML approaches for phenotype prediction of HD SNP Gain Gene GO rs10169129 21.090667724609375 MYT1L GO:0140110 rs61997076 17.790246963500977 FAN1 extra genes rs141338757 15.449226379394531 HTT extra genes rs2393777 14.811421394348145 HNF1A GO:0140110 rs245105 14.218707084655762 DHFR GO:0046655 rs2316153 11.806947708129883 CDYL2 GO:0140110 rs144287831 11.36458683013916 MLH1 GO:0006298 rs79727797 11.205156326293945 TCERG1 GO:0140110 rs139931071 10.507678985595703 STAM GO:0051402 rs2293317 10.322250366210938 FAN1 extra genes rs149561857 10.169393539428711 MAPT GO:0098930 rs57013064 10.138040542602539 CDYL2 GO:0140110 rs72841819 10.081390380859375 BTRC GO:0043161 rs10779614 9.906745910644531 PTPN14 GO:0140110 rs274883 9.713150024414062 LIG1 GO:0006298 rs1543108 9.711298942565918 CTSO GO:0004197 rs45506797 9.655911445617676 PAX3 GO:0140110 rs2949568 9.652124404907227 FAN1 extra genes rs1476130 9.63764762878418 ESRRG GO:0140110 rs6856354 9.573089599609375 CTSO GO:0004197 rs118089305 9.558370590209961 FAN1 extra genes rs117606009 9.500611305236816 HOXB7 GO:0140110 rs34837214 9.182149887084961 UBR5 GO:0043130 rs6875184 8.844964027404785 DHFR GO:0046655 rs6875489 8.843254089355469 RHOBTB3 GO:0043161 rs2074927 8.828694343566895 CLC GO:0004197 rs3789821 8.79568099975586 SMARCD3 GO:0140110 rs34065247 8.734062194824219 ESRRB GO:0140110 rs4053615 8.51319694519043 BACH2 GO:0140110 rs113189219 7.644144058227539 PLAGL1 GO:0140110 rs4908624 7.346027374267578 CAMTA1 GO:0140110 rs5743061 6.857511520385742 PMS1 GO:0006298 Table 11: Hist XGBoost SNPs by feature importance. 69 D Appendix D GO Lasso Elastic Net Hist XGBoost Approx XGBoost Random Forest GO:0140110 1.1687 1.1263 0.5569 0.4654 1.1143 GO:0043161 0.7598 0.8009 0.7326 0.709 0.9038 GO:0031625 1.362 1.3084 0.0 0.4995 1.2984 GO:0035249 0.7434 0.7764 0.0 1.1388 0.7799 GO:0098930 0.7487 0.7556 0.6506 1.3011 0.7216 GO:0051402 0.832 0.8526 0.6995 0.6776 0.8767 GO:0042157 0.9395 1.0129 0.0 0.9123 0.9367 GO:0004197 0.8981 1.0214 3.4052 1.0287 0.6375 GO:0006112 0.68 0.7841 0.0 5.2856 0.9762 GO:0043130 0.2872 0.4826 1.9962 0.0 1.143 GO:0006298 2.2613 1.5241 8.5739 5.3471 1.1612 GO:0016579 0.3036 0.3014 0.0 0.0 0.5311 GO:0008242 1.6281 2.1865 0.0 0.0 2.6887 GO:0046655 1.3075 1.4612 18.7844 18.1784 0.7319 extra genes 3.1109 3.3109 70.6949 38.1752 2.3217 Table 12: GO enrichment analysis results: odds ratio for each comparison against the background counts obtained with a Fisher test. GO Lasso Elastic Net Hist XGBoost Approx XGBoost Random Forest GO:0140110 0.0272 0.0493 0.1077 0.0336 0.0733 GO:0043161 0.0486 0.0558 1.0 1.0 0.3942 GO:0031625 0.017 0.0177 0.2601 0.721 0.0242 GO:0035249 0.0958 0.0895 0.4162 0.6966 0.0893 GO:0098930 0.1268 0.0828 1.0 0.6687 0.0437 GO:0051402 0.3611 0.3515 1.0 1.0 0.43 GO:0042157 0.849 0.9345 0.6269 1.0 0.8048 GO:0004197 0.6876 0.8621 0.0675 1.0 0.036 GO:0006112 0.2543 0.3829 1.0 0.0231 1.0 GO:0043130 0.0039 0.0245 0.4013 1.0 0.5547 GO:0006298 0.0005 0.0582 0.0065 0.0589 0.4977 GO:0016579 0.0229 0.0074 1.0 1.0 0.0943 GO:0008242 0.2437 0.0412 1.0 1.0 0.0035 GO:0046655 0.5561 0.3158 0.0058 0.0062 0.8042 extra genes 0.0087 0.0012 0.0 0.0001 0.035 Table 13: GO enrichment analysis results: p-values for each comparison against the background counts obtained with a Fisher test. 70 Exploring ML approaches for phenotype prediction of HD G O B ac k gr ou n d L as so E la st ic N et H is t X G B o os t A p p ro x X G B o os t R an d om F or es t G O :0 14 01 10 19 80 66 53 7 71 1 14 13 70 5 G O :0 04 31 61 28 34 9 56 79 2 2 88 G O :0 03 16 25 20 01 3 68 88 0 1 87 G O :0 03 52 49 18 22 3 35 49 0 2 49 G O :0 09 89 30 16 05 7 31 42 1 2 40 G O :0 05 14 02 14 98 4 32 44 1 1 45 G O :0 04 21 57 11 25 7 27 39 0 1 36 G O :0 00 41 97 10 02 1 23 35 3 1 22 G O :0 00 61 12 63 11 11 17 0 3 21 G O :0 04 31 30 54 05 4 9 1 0 21 G O :0 00 62 98 40 52 23 21 3 2 16 G O :0 01 65 79 38 48 3 4 0 0 7 G O :0 00 82 42 12 08 5 9 0 0 11 G O :0 04 66 55 12 02 4 6 2 2 3 ex tr a ge n es 88 8 7 10 5 3 7 T ab le 14 : C ou n ts of G O te rm s am on gs t ea ch m o d el fe at u re s, in cl u d in g th e co u n ts in th e tr ai n in g fe at u re m at ri x (b ac k gr ou n d ). T ab le u se d in th e G O en ri ch m en t an al y si s. 71