Màster Oficial - Fonaments de la Ciència de Dades

URI permanent per a aquesta col·leccióhttps://hdl.handle.net/2445/133281

Treballs finals del Màster de Fonaments de Ciència de Dades de la Facultat de Matemàtiques i Informàtica de la Universitat de Barcelona

Estadístiques

Examinar

Mostrant 1 - 20 de 96

Augmenting phenotype prediction models leveraging a genomic Large Language Model
(2025-06-30) Zavou, Georgia; Abante Llenas, Jordi
Huntington’s disease (HD) is a progressive neurodegenerative disorder caused by CAG repeat expansion in the HTT gene. While the length of this expansion explains a large portion of the variability in age of onset (AO), additional genetic modifiers, including regulatory variants, contribute to the remaining variability. In this work, we investigate the utility of genomic language models (gLMs), specifically Borzoi, for predicting tissue-specific gene expression changes from individual genomic data. We applied Borzoi to whole-genome sequencing data and inte- grated RNA-seq coverage predictions for relevant brain regions, including putamen and caudate. After weighting logSED scores using enhancer proximity, we aggregated these expression predictions at the gene level. We then trained multiple machine learning models to classify AO residuals such as a baseline XGBoost model using coding SNPs, CAG repeat length, and sex, an expression-based model using Borzoi-derived features and a multimodal model combining both genomic and predicted expression features. Our results show that Borzoi expression predictions capture meaningful regulatory signals, with functional enrichment analysis highlighting genes involved in transcription regulation, DNA repair, and glutamate signaling. While genotype-based models achieved the highest predictive performance, the multimodal model demonstrated complementary information from expression-based features. This study illustrates the potential of incorporating gLM-based expression predictions into phenotype modeling, offering insights into HD molecular mechanisms and genetic modifiers. The corresponding notebooks and scripts for this thesis, can be found in the following GitHub Repository: https://github.com/gzavou/FPDS_Thesis
Pocket-Aware Molecular Generation Through Learned Protein Representations
(2025-06-30) Valverde Sánchez, Claudia; Igual Muñoz, Laura
Drug discovery is constrained not only by the immense chemical space but by the difficulty of efficiently exploring it and the high cost of traditional screening methods. This thesis introduces and evaluates a deep learning (DL) strategy for the de novo generation of small molecules designed to bind specific protein pockets, aiming to accelerate the identification of novel drug candidates. Our approach leverages pre-trained protein and pocket embeddings within a decoder-only Transformer architecture that learns to translate complex biological information into SMILES strings. Given the early stage of conditional binder generation, this work emphasizes systematic experimentation and thorough performance evaluation. We explored various protein and pocket representation strategies, including global protein (ESM2), structural-aware protein (SaProt), pocket-specific (PickPocket), and integrated Drug-Target Interaction (TensorDTI) embeddings. Our comprehensive evaluation pipeline assessed molecule validity, novelty, internal and cross-model diversity, physicochemical properties, and predicted drug-target interactions. Key findings include demonstrating that a high proportion of viral proteins in the training data does not bias generation, and that different input representations guide the model to explore distinct chemical spaces. While the models effectively generate diverse molecules with favorable drug-like properties, a notable limitation is their propensity to produce exact matches to the training set, indicating overfitting. Furthermore, despite the model’s sensitivity to pocket information, case studies of two specific kinase proteins revealed a challenge in consistently generating truly pocket-specific molecules, likely because of data set characteristics such as promiscuous motifs. This work provides valuable insights into the capabilities and current limitations of pocket-aware generative models, laying a foundation for future advancements in targeted molecule design.
Application of One Class Models for Financial Risk Classification
(2025-06-30) Rey Davila, Ana; Pujol Vila, Oriol
This project explores the use of One Class Classification methods to predict credit risk in highly imbalanced financial datasets. Unlike traditional supervised models, OCC approaches focus only on the majority class, in this case, customers with good payment behaviour, and aim to detect unusual patterns that might suggest a higher risk of default. The study is divided into three experimental phases. The first phase uses a limited set of 13 variables, selected and categorised by experts based on risk. The second phase removes this expert selection and uses all available features. In the third phase, a hybrid strategy is tested by adding the anomaly scores generated by OCC models as extra input variables to supervised models. The models are evaluated using ROC AUC and PR AUC, two metrics well suited for imbalanced classification problems. The main goal is to analyse whether anomaly detection techniques can support or improve current risk assessment strategies in a real business setting. However, the results did not confirm the initial hypothesis, as One Class models and hybrid approaches did not outperform traditional supervised methods.
Forecasting Urban Traffic Patterns in London Using Hybrid AI Techniques
(2025-06-30) Lambrou, Theodoros; Vitrià i Marca, Jordi
Accurately forecasting traffic incident severity is crucial for urban mobility planning and real-time traffic management. This thesis explores a hybrid approach to classifying traffic severity levels using statistical and machine learning techniques. The dataset includes road segment-level hourly traffic observations in London, enriched with engineered features such as recent severity history, weather conditions, and baseline severity probabilities. We evaluate a range of models, from simple baselines to advanced classifiers, with a focus on Random Forest and XGBoost. After extensive experimentation, a tuned Random Forest model using balanced subsampling and moderate tree depth outperformed all other approaches in terms of macro-averaged F1-score and minority class recall. Detailed evaluation through time-based cross-validation, SHAP analysis, and visual diagnostics demonstrates the robustness of this model and highlights key predictive factors. The findings suggest that combining short-term temporal features with baseline statistical probabilities significantly improves performance, particularly for under-represented severity classes. The report also discusses limitations related to data coverage, class imbalance, and the potential of incorporating external signals such as incidents or public transport disruptions in future work. The corresponding python notebooks, scripts and data for this thesis are located in this GitHub repository: https://github.com/theol-10/datascience-thesis/.
Regularization-Based Machine Unlearning
(2025-06-30) Jutglar Puig, Arnau; Statuto, Nahuel; Jacques Junior, Julio C. S.
This work treats the unlearning problem in machine learning (ML). This is the process to make ML models forget some subset of their training data. We restrict this study to deep learning architectures. We propose a metric to assess different unlearning algorithms. We design a new unlearning algorithm, Regret, and compare its performance with respect to Fine-tuning and our implementation of Fanchuan. We test them on four datasets and two different architectures. The experiments reveal that Regret outperforms Fine-tuning by a small margin. Moreover, our implementation of Fanchuan is the best-performing algorithm and surpasses the other two clearly.
Exploring Academic Relationships with UMAP: Dimensionality Reduction and Visualization of Topics and Authors in OpenAlex
(2025-06-30) García Romo, Alba; Marinelli, Dimitri
This thesis applies Uniform Manifold Approximation and Projection (UMAP) to analyse and visualize research works from the OpenAlex database. By using various embedding methods (including transformer-based models and hierarchical topic encodings) the study demonstrates that UMAP projections can effectively capture meaningful structures in the data, revealing relationships among research areas and institutions. Results show that capturing complex topic relationships across multiple domains is a challenging task. Nevertheless, the visualizations reveal significant thematic clusters and author groupings that align with our data analysis. Quan- titative evaluation using clustering metrics, such as the silhouette score, confirms the agreement between visual patterns and semantic embeddings. We also show the impact of UMAP hyperparameters on balancing local and global data structure preservation, which influences visualization clarity and interpretability. The resulting interactive, zoomable visual maps provide researchers with a powerful tool to explore and understand the organization of scientific knowledge.
Evaluating Tool-Augmented ReAct Language Agents
(2025-06-30) Eguzkitza Zalakain, Jokin; Igual Muñoz, Laura
This thesis studies how to evaluate ReAct agents that use external tools. ReAct agents are AI Agents that combine reasoning and tool use (functions), allowing large language models to perform tasks that require accessing external sources of information. These agents are becoming more common in real applications, but evaluating their behaviour remains a challenge. Using LangGraph and LangChain three different AI agents are created using locally deployed LLM models served with Ollama. These agents use open-source tools like Wikipedia, Wikidata, Yahoo Finance and PDF readers. To evaluate them, the project combines rule-based checks with RAGAS metrics to measure tool use, answer quality, factual correctness and context use. The results show that prompt design is very important to guide the agent’s behaviour, and that typical question-answer metrics are not always enough to measure how well an agent works. This work offers a simple and practical way to test LLM agents. All the corresponding code notebook can be found on the following repository, https://github.com/Jokinn9/Evaluating-Tool-Augmented-ReAct-Language-Agents
Classification of Honeypot Data Using the MITRE Framework
(2025-06-30) Camps i Regàs, Hug; Puertas i Prats, Eloi
Proactive cybersecurity measures are essential for effective risk mitigation in increasingly complex and evolving digital environments. Achieving this requires not only the collection of relevant data but also its accurate interpretation and the development of specialized analytical frameworks. This project focuses on addressing the challenge of interpreting cyber threat data by classifying honeypot data, provided by the Global Cyber Alliance (GCA), according to the MITRE ATT&CK Matrix—a widely recognized framework for understanding adversarial behavior. In an era dominated by large language models (LLMs), we investigate an alternative approach based on smaller, specialized models. Specifically, we design a custom architecture of lightweight models and train them for the task, evaluating their performance across various configurations. Our findings demonstrate that these models can, in certain scenarios, outperform larger LLMs in both accuracy and efficiency, offering a more sustainable and cost-effective solution for targeted cybersecurity applications.
Enhancing Few-Shot Learning with Large Language Models
(2025-06-30) Diéguez Vilà, Joel; Radeva, Petia
Recently, Few-Shot Learning has gained significant momentum in the machine learning community. This field focuses on enabling models to learn from extremely limited data, often just a handful of examples per class. Unlike traditional deep learning, which relies on large-scale datasets, few-shot learning requires novel, efficient strategies that challenge conventional assumptions and fundamentally shift the paradigm toward "learning to learn", for faster, more adaptable models. In this work, we explore the most common approaches to few-shot learning and introduce our own method. Building upon the SemFew framework, we propose a metric-based meta-learning approach using Prototypical Networks, enhanced with a semantic support module. This module uses class descriptions from WordNet, refined through a Large Language Model, to provide high-quality semantic embeddings that guide the model in understanding novel classes. Our proposed model is remarkably simple yet highly effective, achieving competitive performance with state-of-the-art methods, specially in 1-shot scenarios (only one example per class). We validate our method across three widely used few-shot classification benchmarks: CIFAR-FS, FC100, and MiniImageNet. The results consistently demonstrate the effectiveness of incorporating semantic guidance to face unseen classes. Further-more, we present an in-depth study of modern LLMs, evaluating their performance across different prompting strategies, and investigating multiple sources of data for generating the best semantic representations. This analysis offers valuable insights into how semantic guidance can be optimized for few-shot learning. Overall, this work demonstrates the power of combining simple metric-based learning with rich semantic embeddings, offering a practical and competitive alternative to more complex architectures while encouraging new directions for future research in few-shot learning. The source code is available at: https://github.com/jdieguvi15/TFM-SemFew.
Explaining word interactions using integrated directional gradients
(2025-06-17) Ballestero Ribó, Marc; Ortiz Martínez, Daniel; Radeva, Petia
Explainability methods are key for understanding the decision-making processes behind complex text models. In this thesis, we theoretically and empirically explore Integrated Directional Gradients (IDG), a method that can attribute importance to both individual features and their high-order interactions for deep neural network (DNN) models. We introduce evaluation metrics to quantitatively assess the quality of the generated explanations, and propose a framework to adapt word-level evaluation methods to high-order phrase-level interactions. Applying IDG to a BERT-based hate speech detection model, we compare its performance at the word level against well-established methods such as Integrated Gradients (IG) and Shapley Additive Explanations (SHAP). Our results indicate that, while IDG’s word-level attributions are less faithful than those of IG and SHAP, they are the best-scoring ones in terms of plausibility. On the other hand, IDG’s high-order importance attributions exhibit high faithfulness metrics, indicating that IDG can consider hierarchical dependencies that traditional methods overlook. Qualitative analyses further support the interpretability of IDG explanations. Overall, this thesis highlights the potential of high-order explanation methods for improving transparency in text models.
Using clinical data for breast cancer risk prediction and follow-up
(2025-01-17) Hernández Antón, Sergio; Díaz, Oliver
Breast cancer remains one of the leading causes of cancer-related morbidity and mortality worldwide, requiring robust methodologies for early risk prediction, recurrence forecasting, and survival analysis. This thesis defines a comprehensive pipeline for breast cancer risk prediction, emphasizing both technical precision and clinical relevance. The proposed framework integrates multiple components: data acquisition, preprocessing, feature extraction, model selection, interpretability, and explainability, in order to ensure accurate, transparent, and actionable outcomes. Overall, this thesis aims to advance the field of breast cancer prediction by delivering a robust, interpretable, and clinically relevant pipeline, aligning with the important goal of improving patient outcomes through early and precise detection. Additionally, in an attempt to make this thesis more reachable, we add a feature dictionary for both used datasets in Appendix A. On top of that, we also share the project in the shape of a GitHub repository, so that people can take profit of this research if at all possible. We also include a guide on its structure in Appendix B.
Time-Varying Topological Descriptors for Cardiac Disease Diagnosis
(2025-01-17) Ferreras Alegre, Jon; Casacuberta, Carles; Igual Muñoz, Laura
Cardiac diseases are among the most common illnesses in the world, and data scientists have created a wide range of tools to contribute to their detection and diagnosis. In particular, topological data analysis has been used to work with medical imaging and specifically with cardiac magnetic resonance images. This project introduces the use of time-varying topological descriptors along a cardiac cycle and applies them for disease diagnosis. The methods used aim to develop the relationship between topological data analysis and temporal data. We also intend to contribute to the simplification, interpretability and improvement of a computational approach to cardiac disease diagnosis, which usually involves costly calculations of radiomics or potential black boxes.
LLM Adaptation Techniques. Evaluating RAG Strategies
(2025-01-17) Castanyer Bibiloni, Francesc Josep; Puertas i Prats, Eloi
This thesis explores the application of Retrieval-Augmented Generation (RAG) systems to optimize question answering tasks, addressing limitations of Large Language Models (LLMs) in scalability, efficiency, and domain adaptability. A theoretical foundation is established, highlighting RAG’s role in integrating external knowledge to enhance language models. A RAG pipeline is implemented and evaluated through experiments analyzing embedding models, similarity metrics, retrieval parameters (k), and re-ranking using cross-encoders. Results demonstrate that re-ranking improves retrieval accuracy, even with noisy, large-scale datasets, and highlight trade-offs between retrieval scope and generative performance. This study underscores RAG’s potential as a scalable alternative to finetuning, enabling efficient adaptation to dynamic datasets. Future research could explore advanced RAG variants and hybrid methods for broader applications. The corresponding code notebook can be found on the following GitHub repository, https://github.com/XiscoCasta/LLM-adaptation-techniques.-Evaluating-RAG-models
Automated clinical coding of medical notes into the SNOMED CT Medical terminology structuring system
(2024-09-01) Cantón Simó, Sergi; Sumoy Van Dyck, Lauro; Igual Muñoz, Laura
Automated clinical coding is the computational process of annotating healthcare free-text data by detecting relevant medical concepts and linking them to a structured medical terminology system. One of the most significant of these systems is SNOMED CT, which contains a vast array of specific medical terms, each identified by a unique ID. This work focuses on the automatic clinical coding of medical notes within the SNOMED CT system. The study presents a comprehensive review of state-of-the-art methods in this field, followed by a detailed examination of two specific approaches, each tested and their results discussed. The first method employs a classical dictionary-based approach, while the second utilizes a deep learning BERT-based model. Additionally, the work introduces a novel contribution to one of these methods and demonstrates a practical application where automatic clinical coding facilitates the extraction of specific numerical values from medical discharge summaries.
Application of the signature method in time series and financial data streams
(2024-06-28) Victoria Galindo, Ana; Vives i Santa Eulàlia, Josep, 1963-
This thesis focuses into the signature method’s role as a robust tool in data science, specifically within the realms of time series analysis and financial data streams. Originating from rough paths theory, the signature method offers a comprehensive representation of sequential data, effectively capturing intricate patterns and dependencies crucial for advanced modeling and predictive analytics. Establishing a solid theoretical foundation, this thesis explores how the signature method transforms raw time series data into structured representations that preserve essential dynamic information. Through theoretical insights and practical illustrations, the thesis demonstrates the method’s efficacy in enhancing model classification, temporal segmentation, and understanding complex model structures.
Education with language models: analyzing uncertainty estimation techniques
(2024-06-28) Tziakouri, Dafni; Vitrià i Marca, Jordi
The widespread adoption of Large Language Models (LLMs) underscores the significance of recognizing both their capabilities and constraints. This study aims to delve into understanding the functioning of Large Language Models (LLMs), with a specific focus on GPT models (Sai, 2023), such as GPT-3.5 (Koubaa, 2023) and GPT-4 (OpenAI, 2023). Additionally, it will demonstrate the development of a Chatbot tailored for educational purposes, employing a diverse array of tools. Through systematic examination, this study seeks to determine whether the utilization of LLMs and GenAI can be deemed trustworthy for educational purposes. Moreover, this research will address the challenge of uncertainty estimation, particularly in black-box models, highlighting the need for reliable methods to evaluate model confidence. The investigation will incorporate various experiments de- signed to evaluate the stability and accuracy of these models. Through comprehensive experimentation, this study seeks to contribute to a deeper understanding of LLMs’ behavior, their potential applications in education, and the challenges associated with uncertainty estimation in black-box models. The corresponding notebooks and datasets for this thesis, can be found in the following GitHub repository, https://github.com/DaphneDjiakouri/MasterThesis.
Analyzing Brand Perception In LLMs
(2024-06-30) Sánchez Salazar, Jaime Leonardo; Pujol Vila, Oriol; Seguí Mesquida, Santi
This thesis investigates brand perception in different Large Language Models (LLMs), focusing on three brands: Apple, Samsung, and Huawei. We first established an understanding of brand perception and the construction of psychometrically sound tests. Leveraging this foundation, we defined four metrics across two dimensions, sentiment and preference, to facilitate a comprehensive analysis. In the sentiment dimension, we observed that the Gemma LLM exhibited consistent bias across all brands, whereas ChatGPT3.5 and ChatGPT4 displayed similar behavior for Apple and Samsung, with notable differences for Huawei. In the preference dimension, all studied LLMs demonstrated transitivity consistency, consistently preferring Apple over Samsung and Samsung over Huawei. Our findings highlight the potential for extensive analysis using the defined metrics, limited here by time constraints. We suggest several avenues for future research, including expanding the range of brands and LLMs analyzed, improving the question bank through collaboration with psychologists, and incorporating varied question connotations and mask questions to enrich the study’s depth. This study provides a methodological framework for assessing brand perception in LLMs, with implications for broader applications beyond the specific brands and models examined.
Evaluating Large Language Models as computer programming teaching assistants
(2024-06-30) Pol Pujadas, Maria Magdalena; Ortiz Martínez, Daniel; Puertas i Prats, Eloi
[en] The principal aim of this project is to conduct an analysis of how different Large Language Models (LLMs) operate in diverse context and situations in the field of education. In particular, we aim to assess the suitability of LLMs for specific tasks within the domain of algorithmic subjects within computer science studies. The tasks under analysis are designed to assist both students and teachers. With regard to students, we will assess the capacity of the models to implement a specified code. When it comes to teachers, we will evaluate the models’ abilities to identify the target of the introduced code and potential errors introduced by students in their codes, enabling students to become more self-taught and seek assistance from teachers when necessary. To evaluate these tasks, we have considered eight models. Two closed-source models were evaluated: GPT-3.5 and GPT-4. Five open-source models were also considered: Llama2, Codellama instruct, Llama3, Platypus2, Deepseek Coder and Qwen-1.5.
Large language models and causal analysis: zero-shot counterfactuals in hate speech perception
(2024-06-30) Hernández Jiménez, Sergio; Pros Rius, Roger; Vitrià i Marca, Jordi
[en] Detecting hate speech is crucial for maintaining the integrity of social media platforms, as it involves identifying content that denigrates individuals or groups based on their characteristics. However, the expression of hate can be different across different demographics and platforms, making its detection a complex task. A significant factor in hate speech is the presence of offense, which alters the perception of hate without altering the core meaning of the text. This study aims to examine how offense affects the perception of hate speech in social media comments. To achieve this, we employ two distinct causal inference methods to measure the impact of offensive language on the detection of hate speech. The first method utilizes the traditional backdoor criterion, which allows us to model the nodes of the causal graph as features in a machine learning model that predicts hate. This method is demanding from a modeling point of view, as it requires training a specific model for each node in the causal graph. The second method leverages the capabilities of Large Language Models (LLMs) to generate textual counterfactuals in a zero-shot manner, i.e., without requiring any training or fine-tuning. These textual counterfactuals are then used to estimate causal effects. Our findings reveal that the causal effect of offense on hate is higher with the LLM generated counterfactuals than with the methodology that follows the backdoor criterion. Additionally, we train a machine learning model to directly predict the causal effect from a comment.
Open data based electricity load forecasting
(2024-06-30) Íñiguez Gómez, David; Pujol Vila, Oriol
[en] Electricity is one of the main engines of modern societies. The agents that are involved in the electricity system of a country need to have the best forecasts possible of electricity load in order to ensure that it is correctly supplied, and also to define their action strategies in the market. In this thesis we will focus on the electricity load forecasting for the daily market of the so called Mercado Ibérico de Electricidad (MIBEL), where most of the energy available is auctioned. We studied the State-of-the-Art of the electricity demand approaches, specially for short-term predictions, since we are making one day-ahead estimations. We extracted data from open sources that were later used for designing and testing different types of models. Based on the performance of the different approaches, we selected a model that efficiently combines both time series forecasting and machine learning, obtaining a precision close to the one provided by the system operator, Red Eléctrica. Finally, we analyzed the relevance of each of the variables involved by using the Shapley values and regularization techniques.

El CRAI romandrà tancat del 24 de desembre de 2025 al 6 de gener de 2026. La validació de documents es reprendrà a partir del 7 de gener de 2026.

El CRAI permanecerá cerrado del 24 de diciembre de 2025 al 6 de enero de 2026. La validación de documentos se reanudará a partir del 7 de enero de 2026.

From 2025-12-24 to 2026-01-06, the CRAI remain closed and the documents will be validated from 2026-01-07.

Màster Oficial - Fonaments de la Ciència de Dades

Examinar

El CRAI romandrà tancat del 24 de desembre de 2025 al 6 de gener de 2026. La validació de documents es reprendrà a partir del 7 de gener de 2026.

El CRAI permanecerá cerrado del 24 de diciembre de 2025 al 6 de enero de 2026. La validación de documentos se reanudará a partir del 7 de enero de 2026.

From 2025-12-24 to 2026-01-06, the CRAI remain closed and the documents will be validated from 2026-01-07.

Examinar

Enviaments recents