Please use this identifier to cite or link to this item:
https://hdl.handle.net/2445/223848
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | López Sánchez, Maite | - |
dc.contributor.author | Li Chen, Chengheng | - |
dc.date.accessioned | 2025-10-23T10:24:28Z | - |
dc.date.available | 2025-10-23T10:24:28Z | - |
dc.date.issued | 2025-06-10 | - |
dc.identifier.uri | https://hdl.handle.net/2445/223848 | - |
dc.description | Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2025, Director: Maite López Sánchez | ca |
dc.description.abstract | Large language models have evolved beyond simple text generation to serve as sophisticated decision-making aids and moral advisors across diverse domains. However, these systems exhibit systematic biases that may compromise their reliability when confronted with complex reasoning tasks, particularly in ethically nuanced scenarios where consistent judgment is important. Despite significant advances in alignment methodologies, including Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), current approaches predominantly focus on preventing overtly harmful outputs while potentially neglecting deeper structural inconsistencies in reasoning processes that can manifest when models encounter contextually biased inputs. This research explores AI alignment by investigating whether established cognitive debiasing techniques from psychology can be systematically adapted and integrated into machine learning training protocols. We introduce the COPO (Consider the Opposite, Perspective-taking, and Open-minded thinking) module, which operationalizes three empirically validated psychological debiasing interventions into computational training methodologies. This approach represents a possible shift from reactive harm mitigation toward proactive development of reasoning capabilities that may demonstrate more principled consistency across diverse contexts. Our methodology combines two complementary investigative approaches: external structured prompting interventions and embedded training pipeline integration. Using 2,491 real world ethical scenarios, we employ three evaluation metrics (Political Disagreement Index, Symmetric Consensus Change, and Overall Intervention Effectiveness) to measure bias reduction with statistical rigor. Structured prompting experiments demonstrate promising bias mitigation, achieving 18.1% reduction in cross-perspective disagreement patterns alongside a favorable 2.6:1 improvement-to-deterioration ratio. The training integration implements a three-phase RL-SFT-RL pipeline encompassing baseline Group Relative Policy Optimization (GRPO), COPO-informed supervised fine-tuning, and transfer assessment through resumed reinforcement learning. This methodology employs multicomponent reward architectures evaluating verdict accuracy, structural compliance, and six dimensional reasoning quality through strong-to-weak supervision. The integrated training achieves 21.9% improvement in ethical reasoning quality, with the model gaining higher rewards after COPO supervised fine-tuning and showing persistence through autonomous learning phases with evidence of knowledge transfer to previously unseen scenarios. Empirical results suggest that psychology-informed interventions can enhance analytical sophistication while reducing contextual bias susceptibility. The enhanced model demonstrates improved stakeholder consideration, systematic evidence integration, and more consistent moral judgment across varied framings without compromising decision accuracy. This work provides evidence that systematically embedding cognitive debiasing techniques into training protocols may enable AI systems to engage in more balanced reasoning, contributing to methodological foundations for psychology-informed AI alignment approaches. | en |
dc.format.extent | 116 p. | - |
dc.format.mimetype | application/pdf | - |
dc.language.iso | eng | ca |
dc.rights | memòria: cc-nc-nd (c) Chengheng Li Chen, 2025 | - |
dc.rights | codi: GPL (c) Chengheng Li Chen, 2025 | - |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/3.0/es/ | - |
dc.rights.uri | http://www.gnu.org/licenses/gpl-3.0.ca.html | * |
dc.source | Treballs Finals de Grau (TFG) - Enginyeria Informàtica | - |
dc.subject.classification | Tractament del llenguatge natural (Informàtica) | ca |
dc.subject.classification | Aprenentatge per reforç (Intel·ligència artificial) | ca |
dc.subject.classification | Raonament qualitatiu | ca |
dc.subject.classification | Ètica | ca |
dc.subject.classification | Programari | ca |
dc.subject.classification | Treballs de fi de grau | ca |
dc.subject.other | Natural language processing (Computer science) | en |
dc.subject.other | Reinforcement learning | en |
dc.subject.other | Qualitative reasoning | en |
dc.subject.other | Ethics | en |
dc.subject.other | Computer software | en |
dc.subject.other | Bachelor's theses | en |
dc.title | Ethical reasoning in Large Language Models | ca |
dc.type | info:eu-repo/semantics/bachelorThesis | ca |
dc.rights.accessRights | info:eu-repo/semantics/openAccess | ca |
Appears in Collections: | Treballs Finals de Grau (TFG) - Enginyeria Informàtica Treballs Finals de Grau (TFG) - Matemàtiques Programari - Treballs de l'alumnat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
TFG_Chengheng_Li_Chen.pdf | Memòria | 4.98 MB | Adobe PDF | View/Open |
Code.zip | Codi font | 37.91 MB | zip | View/Open |
This item is licensed under a
Creative Commons License