Explaining word interactions using integrated directional gradients

dc.contributor.advisorOrtiz Martínez, Daniel
dc.contributor.advisorRadeva, Petia
dc.contributor.authorBallestero Ribó, Marc
dc.date.accessioned2025-09-09T09:26:58Z
dc.date.available2025-09-09T09:26:58Z
dc.date.issued2025-06-17
dc.descriptionTreballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona. Any: 2025. Tutor: Daniel Ortiz Martínez i Petia Radevaca
dc.description.abstractExplainability methods are key for understanding the decision-making processes behind complex text models. In this thesis, we theoretically and empirically explore Integrated Directional Gradients (IDG), a method that can attribute importance to both individual features and their high-order interactions for deep neural network (DNN) models. We introduce evaluation metrics to quantitatively assess the quality of the generated explanations, and propose a framework to adapt word-level evaluation methods to high-order phrase-level interactions. Applying IDG to a BERT-based hate speech detection model, we compare its performance at the word level against well-established methods such as Integrated Gradients (IG) and Shapley Additive Explanations (SHAP). Our results indicate that, while IDG’s word-level attributions are less faithful than those of IG and SHAP, they are the best-scoring ones in terms of plausibility. On the other hand, IDG’s high-order importance attributions exhibit high faithfulness metrics, indicating that IDG can consider hierarchical dependencies that traditional methods overlook. Qualitative analyses further support the interpretability of IDG explanations. Overall, this thesis highlights the potential of high-order explanation methods for improving transparency in text models.ca
dc.format.extent75 p.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/2445/223054
dc.language.isoengca
dc.rightscc-by-nc-nd (c) Marc Ballestero Ribó, 2025
dc.rightscodi: MIT (c) Marc Ballestero Ribó, 2025
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/es/*
dc.rights.urihttp://www.gnu.org/licenses/gpl-3.0.ca.html*
dc.sourceMàster Oficial - Fonaments de la Ciència de Dades
dc.subject.classificationXarxes neuronals (Informàtica)
dc.subject.classificationTractament del llenguatge natural (Informàtica)
dc.subject.classificationDiscurs de l'odi
dc.subject.classificationTreballs de fi de màster
dc.subject.otherNeural networks (Computer science)
dc.subject.otherNatural language processing (Computer science)
dc.subject.otherHate speech
dc.subject.otherMaster's thesis
dc.titleExplaining word interactions using integrated directional gradientsca
dc.typeinfo:eu-repo/semantics/masterThesisca

Fitxers

Paquet original

Mostrant 1 - 2 de 2
Carregant...
Miniatura
Nom:
IDG_HateXplain-main.zip
Mida:
1.67 GB
Format:
ZIP file
Descripció:
Codi font
Carregant...
Miniatura
Nom:
TFM_Ballestero_Ribó_Marc.pdf
Mida:
7.89 MB
Format:
Adobe Portable Document Format
Descripció:
Memòria