Augmenting phenotype prediction models leveraging a genomic Large Language Model

dc.contributor.advisorAbante Llenas, Jordi
dc.contributor.authorZavou, Georgia
dc.date.accessioned2025-09-25T06:43:01Z
dc.date.available2025-09-25T06:43:01Z
dc.date.issued2025-06-30
dc.descriptionTreballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona. Any: 2025. Tutor: Jordi Abante Llenasca
dc.description.abstractHuntington’s disease (HD) is a progressive neurodegenerative disorder caused by CAG repeat expansion in the HTT gene. While the length of this expansion explains a large portion of the variability in age of onset (AO), additional genetic modifiers, including regulatory variants, contribute to the remaining variability. In this work, we investigate the utility of genomic language models (gLMs), specifically Borzoi, for predicting tissue-specific gene expression changes from individual genomic data. We applied Borzoi to whole-genome sequencing data and inte- grated RNA-seq coverage predictions for relevant brain regions, including putamen and caudate. After weighting logSED scores using enhancer proximity, we aggregated these expression predictions at the gene level. We then trained multiple machine learning models to classify AO residuals such as a baseline XGBoost model using coding SNPs, CAG repeat length, and sex, an expression-based model using Borzoi-derived features and a multimodal model combining both genomic and predicted expression features. Our results show that Borzoi expression predictions capture meaningful regulatory signals, with functional enrichment analysis highlighting genes involved in transcription regulation, DNA repair, and glutamate signaling. While genotype-based models achieved the highest predictive performance, the multimodal model demonstrated complementary information from expression-based features. This study illustrates the potential of incorporating gLM-based expression predictions into phenotype modeling, offering insights into HD molecular mechanisms and genetic modifiers. The corresponding notebooks and scripts for this thesis, can be found in the following GitHub Repository: https://github.com/gzavou/FPDS_Thesisen
dc.format.extent46 p.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/2445/223375
dc.language.isoengca
dc.rightscc-by-nc-nd (c) Georgia Zavou, 2025
dc.rightscodi: GPL (c) Georgia Zavou, 2025
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/es/*
dc.rights.urihttp://www.gnu.org/licenses/gpl-3.0.ca.html*
dc.sourceMàster Oficial - Fonaments de la Ciència de Dades
dc.subject.classificationMalalties neurodegeneratives
dc.subject.classificationGenòmica
dc.subject.classificationAprenentatge automàtic
dc.subject.classificationTreballs de fi de màster
dc.subject.otherNeurodegenerative Diseases
dc.subject.otherGenomics
dc.subject.otherMachine learning
dc.subject.otherMaster's thesis
dc.titleAugmenting phenotype prediction models leveraging a genomic Large Language Modelca
dc.typeinfo:eu-repo/semantics/masterThesisca

Fitxers

Paquet original

Mostrant 1 - 2 de 2
Carregant...
Miniatura
Nom:
TFM_Georgia_Zavou.pdf
Mida:
2.42 MB
Format:
Adobe Portable Document Format
Descripció:
Memòria
Carregant...
Miniatura
Nom:
FPDS_Thesis-master.zip
Mida:
127.59 KB
Format:
ZIP file
Descripció:
Codi font