Avui, dilluns 8 de juny, el Dipòsit Digital no estarà operatiu de 15:00 a 17:00 h per tasques de manteniment. Disculpeu les molèsties.
Hoy, lunes 8 de junio, el Dipòsit Digital no estará operativo de 15:00 a 17:00 h debido a tareas de mantenimiento. Disculpen las molestias.
Today, Monday, Jun 8th, the Digital Repository will be unavailable due to a system update.

Document type

Master thesis

Publication date

Publication license

cc-by-nc-nd (c) Georgia Zavou, 2025
Please use this identifier to cite or link to this item: https://hdl.handle.net/2445/223375

Augmenting phenotype prediction models leveraging a genomic Large Language Model

Journal Title

Journal ISSN

Volume Title

Related resource

Abstract

Huntington’s disease (HD) is a progressive neurodegenerative disorder caused by CAG repeat expansion in the HTT gene. While the length of this expansion explains a large portion of the variability in age of onset (AO), additional genetic modifiers, including regulatory variants, contribute to the remaining variability. In this work, we investigate the utility of genomic language models (gLMs), specifically Borzoi, for predicting tissue-specific gene expression changes from individual genomic data. We applied Borzoi to whole-genome sequencing data and inte- grated RNA-seq coverage predictions for relevant brain regions, including putamen and caudate. After weighting logSED scores using enhancer proximity, we aggregated these expression predictions at the gene level. We then trained multiple machine learning models to classify AO residuals such as a baseline XGBoost model using coding SNPs, CAG repeat length, and sex, an expression-based model using Borzoi-derived features and a multimodal model combining both genomic and predicted expression features. Our results show that Borzoi expression predictions capture meaningful regulatory signals, with functional enrichment analysis highlighting genes involved in transcription regulation, DNA repair, and glutamate signaling. While genotype-based models achieved the highest predictive performance, the multimodal model demonstrated complementary information from expression-based features. This study illustrates the potential of incorporating gLM-based expression predictions into phenotype modeling, offering insights into HD molecular mechanisms and genetic modifiers. The corresponding notebooks and scripts for this thesis, can be found in the following GitHub Repository: https://github.com/gzavou/FPDS_Thesis

Description

Treballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona. Any: 2025. Tutor: Jordi Abante Llenas

Citation

Citation

ZAVOU, Georgia. Augmenting phenotype prediction models leveraging a genomic Large Language Model. [consulted: 9 of June of 2026]. Available at: https://hdl.handle.net/2445/223375

Export metadata

JSON - METS

Share record