Please use this identifier to cite or link to this item:
https://hdl.handle.net/2445/223375
Title: | Augmenting phenotype prediction models leveraging a genomic Large Language Model |
Author: | Zavou, Georgia |
Director/Tutor: | Abante Llenas, Jordi |
Keywords: | Malalties neurodegeneratives Genòmica Aprenentatge automàtic Treballs de fi de màster Neurodegenerative Diseases Genomics Machine learning Master's thesis |
Issue Date: | 30-Jun-2025 |
Abstract: | Huntington’s disease (HD) is a progressive neurodegenerative disorder caused by CAG repeat expansion in the HTT gene. While the length of this expansion explains a large portion of the variability in age of onset (AO), additional genetic modifiers, including regulatory variants, contribute to the remaining variability. In this work, we investigate the utility of genomic language models (gLMs), specifically Borzoi, for predicting tissue-specific gene expression changes from individual genomic data. We applied Borzoi to whole-genome sequencing data and inte- grated RNA-seq coverage predictions for relevant brain regions, including putamen and caudate. After weighting logSED scores using enhancer proximity, we aggregated these expression predictions at the gene level. We then trained multiple machine learning models to classify AO residuals such as a baseline XGBoost model using coding SNPs, CAG repeat length, and sex, an expression-based model using Borzoi-derived features and a multimodal model combining both genomic and predicted expression features. Our results show that Borzoi expression predictions capture meaningful regulatory signals, with functional enrichment analysis highlighting genes involved in transcription regulation, DNA repair, and glutamate signaling. While genotype-based models achieved the highest predictive performance, the multimodal model demonstrated complementary information from expression-based features. This study illustrates the potential of incorporating gLM-based expression predictions into phenotype modeling, offering insights into HD molecular mechanisms and genetic modifiers. The corresponding notebooks and scripts for this thesis, can be found in the following GitHub Repository: https://github.com/gzavou/FPDS_Thesis |
Note: | Treballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona. Any: 2025. Tutor: Jordi Abante Llenas |
URI: | https://hdl.handle.net/2445/223375 |
Appears in Collections: | Màster Oficial - Fonaments de la Ciència de Dades Programari - Treballs de l'alumnat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
TFM_Georgia_Zavou.pdf | Memòria | 2.48 MB | Adobe PDF | View/Open |
FPDS_Thesis-master.zip | Codi font | 127.59 kB | zip | View/Open |
This item is licensed under a
Creative Commons License