Ethical reasoning in Large Language Models

Li Chen, Chengheng

Please use this identifier to cite or link to this item: https://hdl.handle.net/2445/223848

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	López Sánchez, Maite	-
dc.contributor.author	Li Chen, Chengheng	-
dc.date.accessioned	2025-10-23T10:24:28Z	-
dc.date.available	2025-10-23T10:24:28Z	-
dc.date.issued	2025-06-10	-
dc.identifier.uri	https://hdl.handle.net/2445/223848	-
dc.description	Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2025, Director: Maite López Sánchez	ca
dc.description.abstract	Large language models have evolved beyond simple text generation to serve as sophisticated decision-making aids and moral advisors across diverse domains. However, these systems exhibit systematic biases that may compromise their reliability when confronted with complex reasoning tasks, particularly in ethically nuanced scenarios where consistent judgment is important. Despite significant advances in alignment methodologies, including Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), current approaches predominantly focus on preventing overtly harmful outputs while potentially neglecting deeper structural inconsistencies in reasoning processes that can manifest when models encounter contextually biased inputs. This research explores AI alignment by investigating whether established cognitive debiasing techniques from psychology can be systematically adapted and integrated into machine learning training protocols. We introduce the COPO (Consider the Opposite, Perspective-taking, and Open-minded thinking) module, which operationalizes three empirically validated psychological debiasing interventions into computational training methodologies. This approach represents a possible shift from reactive harm mitigation toward proactive development of reasoning capabilities that may demonstrate more principled consistency across diverse contexts. Our methodology combines two complementary investigative approaches: external structured prompting interventions and embedded training pipeline integration. Using 2,491 real world ethical scenarios, we employ three evaluation metrics (Political Disagreement Index, Symmetric Consensus Change, and Overall Intervention Effectiveness) to measure bias reduction with statistical rigor. Structured prompting experiments demonstrate promising bias mitigation, achieving 18.1% reduction in cross-perspective disagreement patterns alongside a favorable 2.6:1 improvement-to-deterioration ratio. The training integration implements a three-phase RL-SFT-RL pipeline encompassing baseline Group Relative Policy Optimization (GRPO), COPO-informed supervised fine-tuning, and transfer assessment through resumed reinforcement learning. This methodology employs multicomponent reward architectures evaluating verdict accuracy, structural compliance, and six dimensional reasoning quality through strong-to-weak supervision. The integrated training achieves 21.9% improvement in ethical reasoning quality, with the model gaining higher rewards after COPO supervised fine-tuning and showing persistence through autonomous learning phases with evidence of knowledge transfer to previously unseen scenarios. Empirical results suggest that psychology-informed interventions can enhance analytical sophistication while reducing contextual bias susceptibility. The enhanced model demonstrates improved stakeholder consideration, systematic evidence integration, and more consistent moral judgment across varied framings without compromising decision accuracy. This work provides evidence that systematically embedding cognitive debiasing techniques into training protocols may enable AI systems to engage in more balanced reasoning, contributing to methodological foundations for psychology-informed AI alignment approaches.	en
dc.format.extent	116 p.	-
dc.format.mimetype	application/pdf	-
dc.language.iso	eng	ca
dc.rights	memòria: cc-nc-nd (c) Chengheng Li Chen, 2025	-
dc.rights	codi: GPL (c) Chengheng Li Chen, 2025	-
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	-
dc.rights.uri	http://www.gnu.org/licenses/gpl-3.0.ca.html	*
dc.source	Treballs Finals de Grau (TFG) - Enginyeria Informàtica	-
dc.subject.classification	Tractament del llenguatge natural (Informàtica)	ca
dc.subject.classification	Aprenentatge per reforç (Intel·ligència artificial)	ca
dc.subject.classification	Raonament qualitatiu	ca
dc.subject.classification	Ètica	ca
dc.subject.classification	Programari	ca
dc.subject.classification	Treballs de fi de grau	ca
dc.subject.other	Natural language processing (Computer science)	en
dc.subject.other	Reinforcement learning	en
dc.subject.other	Qualitative reasoning	en
dc.subject.other	Ethics	en
dc.subject.other	Computer software	en
dc.subject.other	Bachelor's theses	en
dc.title	Ethical reasoning in Large Language Models	ca
dc.type	info:eu-repo/semantics/bachelorThesis	ca
dc.rights.accessRights	info:eu-repo/semantics/openAccess	ca
Appears in Collections:	Treballs Finals de Grau (TFG) - Enginyeria Informàtica Treballs Finals de Grau (TFG) - Matemàtiques Programari - Treballs de l'alumnat

Files in This Item:

File	Description	Size	Format
TFG_Chengheng_Li_Chen.pdf	Memòria	4.98 MB	Adobe PDF	View/Open
Code.zip	Codi font	37.91 MB	zip	View/Open

Show simple item record

This item is licensed under a Creative Commons License