Explaining word interactions using integrated directional gradients

Explainability methods are key for understanding the decision-making processes behind complex text models. In this thesis, we theoretically and empirically explore Integrated Directional Gradients (IDG), a method that can attribute importance to both individual features and their high-order interactions for deep neural network (DNN) models. We introduce evaluation metrics to quantitatively assess the quality of the generated explanations, and propose a framework to adapt word-level evaluation methods to high-order phrase-level interactions. Applying IDG to a BERT-based hate speech detection model, we compare its performance at the word level against well-established methods such as Integrated Gradients (IG) and Shapley Additive Explanations (SHAP). Our results indicate that, while IDG’s word-level attributions are less faithful than those of IG and SHAP, they are the best-scoring ones in terms of plausibility. On the other hand, IDG’s high-order importance attributions exhibit high faithfulness metrics, indicating that IDG can consider hierarchical dependencies that traditional methods overlook. Qualitative analyses further support the interpretability of IDG explanations. Overall, this thesis highlights the potential of high-order explanation methods for improving transparency in text models.

Descripció

Treballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona. Any: 2025. Tutor: Daniel Ortiz Martínez i Petia Radeva

Matèries

Xarxes neuronals (Informàtica), Tractament del llenguatge natural (Informàtica), Discurs de l'odi, Treballs de fi de màster

Matèries (anglès)

Neural networks (Computer science), Natural language processing (Computer science), Hate speech, Master's thesis

Col·leccions

Màster Oficial - Fonaments de la Ciència de Dades
Programari - Treballs de l'alumnat

Pàgina completa de l'ítem

Citació

BALLESTERO RIBÓ, Marc. Explaining word interactions using integrated directional gradients. [consulta: 25 de febrer de 2026]. [Disponible a: https://hdl.handle.net/2445/223054]

Estadístiques

Exportar metadades

JSON - METS

Fitxers

Tipus de document

Data de publicació

Llicència de publicació

Explaining word interactions using integrated directional gradients

Títol de la revista

Autors

Director/Tutor

ISSN de la revista

Títol del volum

Recurs relacionat

Resum

Descripció

Matèries

Matèries (anglès)

Citació

Col·leccions

Citació

Exportar metadades

Fitxers

Tipus de document

Data de publicació

Llicència de publicació

Explaining word interactions using integrated directional gradients

Títol de la revista

Autors

Director/Tutor

ISSN de la revista

Títol del volum

Recurs relacionat

Resum

Descripció

Matèries

Matèries (anglès)

Citació

Col·leccions

Citació

Exportar metadades

Compartir registre