Màster Oficial - Fonaments de la Ciència de Dades

URI permanent per a aquesta col·leccióhttps://hdl.handle.net/2445/133281

Treballs finals del Màster de Fonaments de Ciència de Dades de la Facultat de Matemàtiques i Informàtica de la Universitat de Barcelona

Estadístiques

Examinar

Enviaments recents

Mostrant 1 - 20 de 102
  • logoOpenAccessTreball de fi de màster
    Distance-based copying of machine learning classifiers
    (2026-01-10) Jiménez Lumbreras, Rubén; Pujol Vila, Oriol
    Copying machine learning black box classifiers is a key framework that allows practitioners to upgrade their old models, enriching them with new properties, changing their architectures or adapting them to comply with the current AI legislations. Thanks to the copying techniques and assumptions, these improvements can be done even in settings where retraining the original system from scratch is not possible, due to resource, protocol or availability constraints. In this work, we propose the use of signed distances to the decision boundary as a replacement of the black box hard labels used to build the copies, and introduce two different algorithms to compute these distances. In addition, we observe that distance-based copying could behave as a model-agnostic regularization technique and develop a flexible framework to reduce the generalization error of the copies. Then, we validate these proposals through a series of experiments on synthetic datasets and real problems. Results show that distance-based copying is successful across multiple relevant settings and evaluation metrics. Furthermore, results also validate the quality of the predicted distances and their potential as uncertainty measures.
  • logoOpenAccessTreball de fi de màster
    A comparative study of fairness methods for clinical predictions using the MIMIC-IV database
    (2026-01-17) Vukovic, Iris; Igual Muñoz, Laura
    Fairness methods are an increasingly important aspect of responsible implementations of machine learning models. As machine learning becomes more intertwined in clinical settings, it is necessary that bias mitigation is accounted for, but performance maintenance remains a challenge. Fairness-aware interpretable modeling (FAIM) [1] is an in-processing fairness method that avoids extreme performance degradation while improving fairness and maintaining interpretability. In this study, the method is stress-tested by changing the original prediction task, hospital admission prediction after emergency department (ED) stay, to the distinct clinical task of predicting necessity of invasive medical ventilation (IMV) for patients in the intensive care unit (ICU) using electronic health record (EHR) data from the recently released MIMIC-IV database. A comparison with the baseline logistic regression model and other state-of-the-art fairness methods is presented and, although bias amongst intersectional demographic subgroups was not completely mitigated with FAIM, there was clear improvement compared to the baseline and also compared to other traditional fairness methods.
  • logoOpenAccessTreball de fi de màster
    A Domain Adaptation Framework for Harmonized Representation Learning in Medical Datasets
    (2026-01-17) Vara Mira, Alejandro; Pujol Vila, Oriol; Lobato Delgado, Bárbara
    This Master’s Thesis addresses the critical challenge of clinical data fragmentation and the prohibitive costs of medical data acquisition by proposing a deep learning architecture for cross-dataset knowledge transfer. While the medical community possesses vast amounts of data, it remains largely trapped in isolated silos characterized by structural heterogeneity and measurement bias. To bridge these gaps, this research introduces a multi-branch neural framework that leverages a large-scale auxiliary dataset, MIMIC-III, to enrich the latent representations of smaller, specialized target datasets. The methodology centers on a dual-encoding strategy where a shared encoder extracts robust statistical patterns from common clinical attributes across populations, while independent private encoders preserve domain-specific niche variables. Empirical validation in the context of ICU mortality prediction demonstrates that this harmonized representation learning consistently improves Precision-Recall and AUC-ROC metrics. By employing a rigorous methodology upon sequential experiments, the study confirms that these performance gains are statistically significant and directly attributable to the enhanced feature representation, rather than artifacts of stochasticity or overfitting. Ultimately, this work provides a scalable blueprint for clinical data codification, proving that common attributes can serve as a functional bridge to maximize the utility of existing medical records in data-constrained environments.
  • logoOpenAccessTreball de fi de màster
    Deconvolving the Transcriptomic Signatures of Somatic Expansion in Huntington’s Disease
    (2026-01-17) Fuses, Caterina; Abante, Jordi; Vitrià i Marca, Jordi
    Huntington’s disease (HD) is driven by somatic CAG repeat expansions, but measuring expansion length alongside transcriptomes at single-cell resolution is experimentally challenging. In this thesis I test DCAG, a deep generative model that predicts somatic expansion stage from single-cell RNA-seq data by disentangling expansion-associated transcriptional variability from confounding factors such as cell type and sequencing technology. In the supervised setting, DCAG outperforms baseline models in balanced accuracy while producing interpretable latent representations. It also shows good performance in the semi-supervised setting trained with cells from different sequencing technologies. The model enables propagation of expansion labels to unlabeled datasets, facilitating mutation-length-aware analyses of existing and future experiments. While developed for HD, the used methodology is generalizable to other polyglutamine disorders or staged molecular processes, providing a computational tool for linking transcriptomic signatures to disease-relevant molecular features.
  • logoOpenAccessTreball de fi de màster
    Comparative Study of Clustering Techniques for Hypnogram Analysis and User-Level Insights
    (2026-01-16) Casas Herce, Carmen; Seguí Mesquida, Santi; Brull Martínez, María
    This thesis aims to develop an unsupervised clustering framework to identify patterns in sleep data recorded by wearable devices. The work compares different algorithms, focusing on distance metrics and feature representations tailored to categorical time series. Firstly, it presents a comparative review of the literature on sleep pattern clustering from polysomnography and wearable data. It summarizes common approaches, feature engineering and validation strategies, and analyses how these methods influence the choices made in this work. Secondly, six clustering algorithms are applied to the sleep data and evaluated using standard clustering scores and the evolution of inertia as the number of clusters increases, in order to assess both stability and interpretability. In particular, the k-modes baseline produces clusters that fail to capture clear differences in sleep patterns, while agglomerative clustering with Hamming distance applied to a distance matrix generates very distinctive but unbalanced groups. To obtain more stable and interpretable groups, K-means clustering is explored using both Dynamic Time Warping (although the algorithm is not designed for categorical data) on the full sequences and a compact feature-based representation including sleep efficiency, REM and deep sleep percentages, and the number of awakenings lasting longer than 5 minutes. Finally, feature-envelope approaches that summarize the temporal evolution of these features across the night are implemented, obtaining a higher quality clustering results and a better characterization of sleep patterns. The conclusions focus primarily on lower values of k, where clustering metrics indicate better performance, suggesting that the underlying structure of the data is more continuous than discrete.
  • logoOpenAccessTreball de fi de màster
    Analysis of Content Diversity in News Recommendation Systems
    (2026-01-17) Blanco Borrás, Rubén; Pujol Vila, Oriol; Vitrià i Marca, Jordi
    This Master’s Thesis studies informational diversity in news recommendation systems from a dual perspective: producer diversity and consumer-perceived diversity. The main objective is to analyze which diversity metrics are most suitable for evaluating journalistic content and how they can be applied to different datasets. In a first phase, a study is conducted on various diversity metrics used in recommendation systems, taking as reference those proposed in the literature associated with Microsoft and using the MIND Small dataset. This analysis allows the evaluation of classical diversity and coverage metrics applied to news recommendation, as well as an understanding of their advantages and limitations in controlled environments. In a second phase, these metrics are adapted and applied to a news dataset from the media outlet 3Cat, aiming to evaluate the diversity of political topics present in the published content. In this context, a distinction is made between producer diversity, related to the ideological and thematic variety of the generated content, and consumer diversity, modeled through an activation metric that approximates the diversity effectively perceived by the reader. The results allow for a comparison of both diversity perspectives and the analysis of potential mismatches between the informational supply and the diversity experienced by the user. The code developed for data processing and metric calculation can be found in the project repository (Blanco Borrás, 2026).
  • logoOpenAccessTreball de fi de màster
    Augmenting phenotype prediction models leveraging a genomic Large Language Model
    (2025-06-30) Zavou, Georgia; Abante Llenas, Jordi
    Huntington’s disease (HD) is a progressive neurodegenerative disorder caused by CAG repeat expansion in the HTT gene. While the length of this expansion explains a large portion of the variability in age of onset (AO), additional genetic modifiers, including regulatory variants, contribute to the remaining variability. In this work, we investigate the utility of genomic language models (gLMs), specifically Borzoi, for predicting tissue-specific gene expression changes from individual genomic data. We applied Borzoi to whole-genome sequencing data and inte- grated RNA-seq coverage predictions for relevant brain regions, including putamen and caudate. After weighting logSED scores using enhancer proximity, we aggregated these expression predictions at the gene level. We then trained multiple machine learning models to classify AO residuals such as a baseline XGBoost model using coding SNPs, CAG repeat length, and sex, an expression-based model using Borzoi-derived features and a multimodal model combining both genomic and predicted expression features. Our results show that Borzoi expression predictions capture meaningful regulatory signals, with functional enrichment analysis highlighting genes involved in transcription regulation, DNA repair, and glutamate signaling. While genotype-based models achieved the highest predictive performance, the multimodal model demonstrated complementary information from expression-based features. This study illustrates the potential of incorporating gLM-based expression predictions into phenotype modeling, offering insights into HD molecular mechanisms and genetic modifiers. The corresponding notebooks and scripts for this thesis, can be found in the following GitHub Repository: https://github.com/gzavou/FPDS_Thesis
  • logoOpenAccessTreball de fi de màster
    Pocket-Aware Molecular Generation Through Learned Protein Representations
    (2025-06-30) Valverde Sánchez, Claudia; Igual Muñoz, Laura
    Drug discovery is constrained not only by the immense chemical space but by the difficulty of efficiently exploring it and the high cost of traditional screening methods. This thesis introduces and evaluates a deep learning (DL) strategy for the de novo generation of small molecules designed to bind specific protein pockets, aiming to accelerate the identification of novel drug candidates. Our approach leverages pre-trained protein and pocket embeddings within a decoder-only Transformer architecture that learns to translate complex biological information into SMILES strings. Given the early stage of conditional binder generation, this work emphasizes systematic experimentation and thorough performance evaluation. We explored various protein and pocket representation strategies, including global protein (ESM2), structural-aware protein (SaProt), pocket-specific (PickPocket), and integrated Drug-Target Interaction (TensorDTI) embeddings. Our comprehensive evaluation pipeline assessed molecule validity, novelty, internal and cross-model diversity, physicochemical properties, and predicted drug-target interactions. Key findings include demonstrating that a high proportion of viral proteins in the training data does not bias generation, and that different input representations guide the model to explore distinct chemical spaces. While the models effectively generate diverse molecules with favorable drug-like properties, a notable limitation is their propensity to produce exact matches to the training set, indicating overfitting. Furthermore, despite the model’s sensitivity to pocket information, case studies of two specific kinase proteins revealed a challenge in consistently generating truly pocket-specific molecules, likely because of data set characteristics such as promiscuous motifs. This work provides valuable insights into the capabilities and current limitations of pocket-aware generative models, laying a foundation for future advancements in targeted molecule design.
  • logoOpenAccessTreball de fi de màster
    Application of One Class Models for Financial Risk Classification
    (2025-06-30) Rey Davila, Ana; Pujol Vila, Oriol
    This project explores the use of One Class Classification methods to predict credit risk in highly imbalanced financial datasets. Unlike traditional supervised models, OCC approaches focus only on the majority class, in this case, customers with good payment behaviour, and aim to detect unusual patterns that might suggest a higher risk of default. The study is divided into three experimental phases. The first phase uses a limited set of 13 variables, selected and categorised by experts based on risk. The second phase removes this expert selection and uses all available features. In the third phase, a hybrid strategy is tested by adding the anomaly scores generated by OCC models as extra input variables to supervised models. The models are evaluated using ROC AUC and PR AUC, two metrics well suited for imbalanced classification problems. The main goal is to analyse whether anomaly detection techniques can support or improve current risk assessment strategies in a real business setting. However, the results did not confirm the initial hypothesis, as One Class models and hybrid approaches did not outperform traditional supervised methods.
  • logoOpenAccessTreball de fi de grau
    Forecasting Urban Traffic Patterns in London Using Hybrid AI Techniques
    (2025-06-30) Lambrou, Theodoros; Vitrià i Marca, Jordi
    Accurately forecasting traffic incident severity is crucial for urban mobility planning and real-time traffic management. This thesis explores a hybrid approach to classifying traffic severity levels using statistical and machine learning techniques. The dataset includes road segment-level hourly traffic observations in London, enriched with engineered features such as recent severity history, weather conditions, and baseline severity probabilities. We evaluate a range of models, from simple baselines to advanced classifiers, with a focus on Random Forest and XGBoost. After extensive experimentation, a tuned Random Forest model using balanced subsampling and moderate tree depth outperformed all other approaches in terms of macro-averaged F1-score and minority class recall. Detailed evaluation through time-based cross-validation, SHAP analysis, and visual diagnostics demonstrates the robustness of this model and highlights key predictive factors. The findings suggest that combining short-term temporal features with baseline statistical probabilities significantly improves performance, particularly for under-represented severity classes. The report also discusses limitations related to data coverage, class imbalance, and the potential of incorporating external signals such as incidents or public transport disruptions in future work. The corresponding python notebooks, scripts and data for this thesis are located in this GitHub repository: https://github.com/theol-10/datascience-thesis/.
  • logoOpenAccessTreball de fi de màster
    Regularization-Based Machine Unlearning
    (2025-06-30) Jutglar Puig, Arnau; Statuto, Nahuel; Jacques Junior, Julio C. S.
    This work treats the unlearning problem in machine learning (ML). This is the process to make ML models forget some subset of their training data. We restrict this study to deep learning architectures. We propose a metric to assess different unlearning algorithms. We design a new unlearning algorithm, Regret, and compare its performance with respect to Fine-tuning and our implementation of Fanchuan. We test them on four datasets and two different architectures. The experiments reveal that Regret outperforms Fine-tuning by a small margin. Moreover, our implementation of Fanchuan is the best-performing algorithm and surpasses the other two clearly.
  • logoOpenAccessTreball de fi de màster
    Exploring Academic Relationships with UMAP: Dimensionality Reduction and Visualization of Topics and Authors in OpenAlex
    (2025-06-30) García Romo, Alba; Marinelli, Dimitri
    This thesis applies Uniform Manifold Approximation and Projection (UMAP) to analyse and visualize research works from the OpenAlex database. By using various embedding methods (including transformer-based models and hierarchical topic encodings) the study demonstrates that UMAP projections can effectively capture meaningful structures in the data, revealing relationships among research areas and institutions. Results show that capturing complex topic relationships across multiple domains is a challenging task. Nevertheless, the visualizations reveal significant thematic clusters and author groupings that align with our data analysis. Quan- titative evaluation using clustering metrics, such as the silhouette score, confirms the agreement between visual patterns and semantic embeddings. We also show the impact of UMAP hyperparameters on balancing local and global data structure preservation, which influences visualization clarity and interpretability. The resulting interactive, zoomable visual maps provide researchers with a powerful tool to explore and understand the organization of scientific knowledge.
  • logoOpenAccessTreball de fi de màster
    Evaluating Tool-Augmented ReAct Language Agents
    (2025-06-30) Eguzkitza Zalakain, Jokin; Igual Muñoz, Laura
    This thesis studies how to evaluate ReAct agents that use external tools. ReAct agents are AI Agents that combine reasoning and tool use (functions), allowing large language models to perform tasks that require accessing external sources of information. These agents are becoming more common in real applications, but evaluating their behaviour remains a challenge. Using LangGraph and LangChain three different AI agents are created using locally deployed LLM models served with Ollama. These agents use open-source tools like Wikipedia, Wikidata, Yahoo Finance and PDF readers. To evaluate them, the project combines rule-based checks with RAGAS metrics to measure tool use, answer quality, factual correctness and context use. The results show that prompt design is very important to guide the agent’s behaviour, and that typical question-answer metrics are not always enough to measure how well an agent works. This work offers a simple and practical way to test LLM agents. All the corresponding code notebook can be found on the following repository, https://github.com/Jokinn9/Evaluating-Tool-Augmented-ReAct-Language-Agents
  • logoOpenAccessTreball de fi de màster
    Classification of Honeypot Data Using the MITRE Framework
    (2025-06-30) Camps i Regàs, Hug; Puertas i Prats, Eloi
    Proactive cybersecurity measures are essential for effective risk mitigation in increasingly complex and evolving digital environments. Achieving this requires not only the collection of relevant data but also its accurate interpretation and the development of specialized analytical frameworks. This project focuses on addressing the challenge of interpreting cyber threat data by classifying honeypot data, provided by the Global Cyber Alliance (GCA), according to the MITRE ATT&CK Matrix—a widely recognized framework for understanding adversarial behavior. In an era dominated by large language models (LLMs), we investigate an alternative approach based on smaller, specialized models. Specifically, we design a custom architecture of lightweight models and train them for the task, evaluating their performance across various configurations. Our findings demonstrate that these models can, in certain scenarios, outperform larger LLMs in both accuracy and efficiency, offering a more sustainable and cost-effective solution for targeted cybersecurity applications.
  • logoOpenAccessTreball de fi de màster
    Enhancing Few-Shot Learning with Large Language Models
    (2025-06-30) Diéguez Vilà, Joel; Radeva, Petia
    Recently, Few-Shot Learning has gained significant momentum in the machine learning community. This field focuses on enabling models to learn from extremely limited data, often just a handful of examples per class. Unlike traditional deep learning, which relies on large-scale datasets, few-shot learning requires novel, efficient strategies that challenge conventional assumptions and fundamentally shift the paradigm toward "learning to learn", for faster, more adaptable models. In this work, we explore the most common approaches to few-shot learning and introduce our own method. Building upon the SemFew framework, we propose a metric-based meta-learning approach using Prototypical Networks, enhanced with a semantic support module. This module uses class descriptions from WordNet, refined through a Large Language Model, to provide high-quality semantic embeddings that guide the model in understanding novel classes. Our proposed model is remarkably simple yet highly effective, achieving competitive performance with state-of-the-art methods, specially in 1-shot scenarios (only one example per class). We validate our method across three widely used few-shot classification benchmarks: CIFAR-FS, FC100, and MiniImageNet. The results consistently demonstrate the effectiveness of incorporating semantic guidance to face unseen classes. Further-more, we present an in-depth study of modern LLMs, evaluating their performance across different prompting strategies, and investigating multiple sources of data for generating the best semantic representations. This analysis offers valuable insights into how semantic guidance can be optimized for few-shot learning. Overall, this work demonstrates the power of combining simple metric-based learning with rich semantic embeddings, offering a practical and competitive alternative to more complex architectures while encouraging new directions for future research in few-shot learning. The source code is available at: https://github.com/jdieguvi15/TFM-SemFew.
  • logoOpenAccessTreball de fi de màster
    Explaining word interactions using integrated directional gradients
    (2025-06-17) Ballestero Ribó, Marc; Ortiz Martínez, Daniel; Radeva, Petia
    Explainability methods are key for understanding the decision-making processes behind complex text models. In this thesis, we theoretically and empirically explore Integrated Directional Gradients (IDG), a method that can attribute importance to both individual features and their high-order interactions for deep neural network (DNN) models. We introduce evaluation metrics to quantitatively assess the quality of the generated explanations, and propose a framework to adapt word-level evaluation methods to high-order phrase-level interactions. Applying IDG to a BERT-based hate speech detection model, we compare its performance at the word level against well-established methods such as Integrated Gradients (IG) and Shapley Additive Explanations (SHAP). Our results indicate that, while IDG’s word-level attributions are less faithful than those of IG and SHAP, they are the best-scoring ones in terms of plausibility. On the other hand, IDG’s high-order importance attributions exhibit high faithfulness metrics, indicating that IDG can consider hierarchical dependencies that traditional methods overlook. Qualitative analyses further support the interpretability of IDG explanations. Overall, this thesis highlights the potential of high-order explanation methods for improving transparency in text models.
  • logoOpenAccessTreball de fi de màster
    Using clinical data for breast cancer risk prediction and follow-up
    (2025-01-17) Hernández Antón, Sergio; Díaz, Oliver
    Breast cancer remains one of the leading causes of cancer-related morbidity and mortality worldwide, requiring robust methodologies for early risk prediction, recurrence forecasting, and survival analysis. This thesis defines a comprehensive pipeline for breast cancer risk prediction, emphasizing both technical precision and clinical relevance. The proposed framework integrates multiple components: data acquisition, preprocessing, feature extraction, model selection, interpretability, and explainability, in order to ensure accurate, transparent, and actionable outcomes. Overall, this thesis aims to advance the field of breast cancer prediction by delivering a robust, interpretable, and clinically relevant pipeline, aligning with the important goal of improving patient outcomes through early and precise detection. Additionally, in an attempt to make this thesis more reachable, we add a feature dictionary for both used datasets in Appendix A. On top of that, we also share the project in the shape of a GitHub repository, so that people can take profit of this research if at all possible. We also include a guide on its structure in Appendix B.
  • logoOpenAccessTreball de fi de màster
    Time-Varying Topological Descriptors for Cardiac Disease Diagnosis
    (2025-01-17) Ferreras Alegre, Jon; Casacuberta, Carles; Igual Muñoz, Laura
    Cardiac diseases are among the most common illnesses in the world, and data scientists have created a wide range of tools to contribute to their detection and diagnosis. In particular, topological data analysis has been used to work with medical imaging and specifically with cardiac magnetic resonance images. This project introduces the use of time-varying topological descriptors along a cardiac cycle and applies them for disease diagnosis. The methods used aim to develop the relationship between topological data analysis and temporal data. We also intend to contribute to the simplification, interpretability and improvement of a computational approach to cardiac disease diagnosis, which usually involves costly calculations of radiomics or potential black boxes.
  • logoOpenAccessTreball de fi de màster
    LLM Adaptation Techniques. Evaluating RAG Strategies
    (2025-01-17) Castanyer Bibiloni, Francesc Josep; Puertas i Prats, Eloi
    This thesis explores the application of Retrieval-Augmented Generation (RAG) systems to optimize question answering tasks, addressing limitations of Large Language Models (LLMs) in scalability, efficiency, and domain adaptability. A theoretical foundation is established, highlighting RAG’s role in integrating external knowledge to enhance language models. A RAG pipeline is implemented and evaluated through experiments analyzing embedding models, similarity metrics, retrieval parameters (k), and re-ranking using cross-encoders. Results demonstrate that re-ranking improves retrieval accuracy, even with noisy, large-scale datasets, and highlight trade-offs between retrieval scope and generative performance. This study underscores RAG’s potential as a scalable alternative to finetuning, enabling efficient adaptation to dynamic datasets. Future research could explore advanced RAG variants and hybrid methods for broader applications. The corresponding code notebook can be found on the following GitHub repository, https://github.com/XiscoCasta/LLM-adaptation-techniques.-Evaluating-RAG-models
  • logoOpenAccessTreball de fi de màster
    Automated clinical coding of medical notes into the SNOMED CT Medical terminology structuring system
    (2024-09-01) Cantón Simó, Sergi; Sumoy Van Dyck, Lauro; Igual Muñoz, Laura
    Automated clinical coding is the computational process of annotating healthcare free-text data by detecting relevant medical concepts and linking them to a structured medical terminology system. One of the most significant of these systems is SNOMED CT, which contains a vast array of specific medical terms, each identified by a unique ID. This work focuses on the automatic clinical coding of medical notes within the SNOMED CT system. The study presents a comprehensive review of state-of-the-art methods in this field, followed by a detailed examination of two specific approaches, each tested and their results discussed. The first method employs a classical dictionary-based approach, while the second utilizes a deep learning BERT-based model. Additionally, the work introduces a novel contribution to one of these methods and demonstrates a practical application where automatic clinical coding facilitates the extraction of specific numerical values from medical discharge summaries.