Explaining word interactions using integrated directional gradients

Ballestero Ribó, Marc

Please use this identifier to cite or link to this item: https://hdl.handle.net/2445/223054

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Ortiz Martínez, Daniel	-
dc.contributor.advisor	Radeva, Petia	-
dc.contributor.author	Ballestero Ribó, Marc	-
dc.date.accessioned	2025-09-09T09:26:58Z	-
dc.date.available	2025-09-09T09:26:58Z	-
dc.date.issued	2025-06-17	-
dc.identifier.uri	https://hdl.handle.net/2445/223054	-
dc.description	Treballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona. Any: 2025. Tutor: Daniel Ortiz Martínez i Petia Radeva	ca
dc.description.abstract	Explainability methods are key for understanding the decision-making processes behind complex text models. In this thesis, we theoretically and empirically explore Integrated Directional Gradients (IDG), a method that can attribute importance to both individual features and their high-order interactions for deep neural network (DNN) models. We introduce evaluation metrics to quantitatively assess the quality of the generated explanations, and propose a framework to adapt word-level evaluation methods to high-order phrase-level interactions. Applying IDG to a BERT-based hate speech detection model, we compare its performance at the word level against well-established methods such as Integrated Gradients (IG) and Shapley Additive Explanations (SHAP). Our results indicate that, while IDG’s word-level attributions are less faithful than those of IG and SHAP, they are the best-scoring ones in terms of plausibility. On the other hand, IDG’s high-order importance attributions exhibit high faithfulness metrics, indicating that IDG can consider hierarchical dependencies that traditional methods overlook. Qualitative analyses further support the interpretability of IDG explanations. Overall, this thesis highlights the potential of high-order explanation methods for improving transparency in text models.	ca
dc.format.extent	75 p.	-
dc.format.mimetype	application/pdf	-
dc.language.iso	eng	ca
dc.rights	cc-by-nc-nd (c) Marc Ballestero Ribó, 2025	-
dc.rights	codi: MIT (c) Marc Ballestero Ribó, 2025	-
dc.rights.uri	http://www.gnu.org/licenses/gpl-3.0.ca.html	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject.classification	Xarxes neuronals (Informàtica)	-
dc.subject.classification	Tractament del llenguatge natural (Informàtica)	-
dc.subject.classification	Discurs de l'odi	-
dc.subject.classification	Treballs de fi de màster	-
dc.subject.other	Neural networks (Computer science)	-
dc.subject.other	Natural language processing (Computer science)	-
dc.subject.other	Hate speech	-
dc.subject.other	Master's thesis	-
dc.title	Explaining word interactions using integrated directional gradients	ca
dc.type	info:eu-repo/semantics/masterThesis	ca
dc.rights.accessRights	info:eu-repo/semantics/openAccess	ca
Appears in Collections:	Màster Oficial - Fonaments de la Ciència de Dades Programari - Treballs de l'alumnat

Files in This Item:

File	Description	Size	Format
IDG_HateXplain-main.zip	Codi font	1.75 GB	zip	View/Open
TFM_Ballestero_Ribó_Marc.pdf	Memòria	8.08 MB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License