Please use this identifier to cite or link to this item: https://hdl.handle.net/2445/223848
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorLópez Sánchez, Maite-
dc.contributor.authorLi Chen, Chengheng-
dc.date.accessioned2025-10-23T10:24:28Z-
dc.date.available2025-10-23T10:24:28Z-
dc.date.issued2025-06-10-
dc.identifier.urihttps://hdl.handle.net/2445/223848-
dc.descriptionTreballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2025, Director: Maite López Sánchezca
dc.description.abstractLarge language models have evolved beyond simple text generation to serve as sophisticated decision-making aids and moral advisors across diverse domains. However, these systems exhibit systematic biases that may compromise their reliability when confronted with complex reasoning tasks, particularly in ethically nuanced scenarios where consistent judgment is important. Despite significant advances in alignment methodologies, including Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), current approaches predominantly focus on preventing overtly harmful outputs while potentially neglecting deeper structural inconsistencies in reasoning processes that can manifest when models encounter contextually biased inputs. This research explores AI alignment by investigating whether established cognitive debiasing techniques from psychology can be systematically adapted and integrated into machine learning training protocols. We introduce the COPO (Consider the Opposite, Perspective-taking, and Open-minded thinking) module, which operationalizes three empirically validated psychological debiasing interventions into computational training methodologies. This approach represents a possible shift from reactive harm mitigation toward proactive development of reasoning capabilities that may demonstrate more principled consistency across diverse contexts. Our methodology combines two complementary investigative approaches: external structured prompting interventions and embedded training pipeline integration. Using 2,491 real world ethical scenarios, we employ three evaluation metrics (Political Disagreement Index, Symmetric Consensus Change, and Overall Intervention Effectiveness) to measure bias reduction with statistical rigor. Structured prompting experiments demonstrate promising bias mitigation, achieving 18.1% reduction in cross-perspective disagreement patterns alongside a favorable 2.6:1 improvement-to-deterioration ratio. The training integration implements a three-phase RL-SFT-RL pipeline encompassing baseline Group Relative Policy Optimization (GRPO), COPO-informed supervised fine-tuning, and transfer assessment through resumed reinforcement learning. This methodology employs multicomponent reward architectures evaluating verdict accuracy, structural compliance, and six dimensional reasoning quality through strong-to-weak supervision. The integrated training achieves 21.9% improvement in ethical reasoning quality, with the model gaining higher rewards after COPO supervised fine-tuning and showing persistence through autonomous learning phases with evidence of knowledge transfer to previously unseen scenarios. Empirical results suggest that psychology-informed interventions can enhance analytical sophistication while reducing contextual bias susceptibility. The enhanced model demonstrates improved stakeholder consideration, systematic evidence integration, and more consistent moral judgment across varied framings without compromising decision accuracy. This work provides evidence that systematically embedding cognitive debiasing techniques into training protocols may enable AI systems to engage in more balanced reasoning, contributing to methodological foundations for psychology-informed AI alignment approaches.en
dc.format.extent116 p.-
dc.format.mimetypeapplication/pdf-
dc.language.isoengca
dc.rightsmemòria: cc-nc-nd (c) Chengheng Li Chen, 2025-
dc.rightscodi: GPL (c) Chengheng Li Chen, 2025-
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/es/-
dc.rights.urihttp://www.gnu.org/licenses/gpl-3.0.ca.html*
dc.sourceTreballs Finals de Grau (TFG) - Enginyeria Informàtica-
dc.subject.classificationTractament del llenguatge natural (Informàtica)ca
dc.subject.classificationAprenentatge per reforç (Intel·ligència artificial)ca
dc.subject.classificationRaonament qualitatiuca
dc.subject.classificationÈticaca
dc.subject.classificationProgramarica
dc.subject.classificationTreballs de fi de grauca
dc.subject.otherNatural language processing (Computer science)en
dc.subject.otherReinforcement learningen
dc.subject.otherQualitative reasoningen
dc.subject.otherEthicsen
dc.subject.otherComputer softwareen
dc.subject.otherBachelor's thesesen
dc.titleEthical reasoning in Large Language Modelsca
dc.typeinfo:eu-repo/semantics/bachelorThesisca
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca
Appears in Collections:Treballs Finals de Grau (TFG) - Enginyeria Informàtica
Treballs Finals de Grau (TFG) - Matemàtiques
Programari - Treballs de l'alumnat

Files in This Item:
File Description SizeFormat 
TFG_Chengheng_Li_Chen.pdfMemòria4.98 MBAdobe PDFView/Open
Code.zipCodi font37.91 MBzipView/Open


This item is licensed under a Creative Commons License Creative Commons