Carregant...
Tipus de document
Treball de fi de grauData de publicació
Llicència de publicació
Si us plau utilitzeu sempre aquest identificador per citar o enllaçar aquest document: https://hdl.handle.net/2445/223848
Ethical reasoning in Large Language Models
Títol de la revista
Autors
ISSN de la revista
Títol del volum
Resum
Large language models have evolved beyond simple text generation to serve as sophisticated decision-making aids and moral advisors across diverse domains. However, these systems exhibit systematic biases that may compromise their reliability when confronted with complex reasoning tasks, particularly in ethically nuanced scenarios where consistent judgment is important. Despite significant advances in alignment methodologies, including Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), current approaches predominantly focus on preventing overtly harmful outputs while potentially neglecting deeper structural inconsistencies in reasoning processes that can manifest when models encounter contextually biased inputs. This research explores AI alignment by investigating whether established cognitive debiasing techniques from psychology can be systematically adapted and integrated into machine learning training protocols. We introduce the COPO (Consider the Opposite, Perspective-taking, and Open-minded thinking) module, which operationalizes three empirically validated psychological debiasing interventions into computational training methodologies. This approach represents a possible shift from reactive harm mitigation toward proactive development of reasoning capabilities that may demonstrate more principled consistency across diverse contexts.
Our methodology combines two complementary investigative approaches: external structured prompting interventions and embedded training pipeline integration. Using 2,491 real world ethical scenarios, we employ three evaluation metrics (Political Disagreement Index, Symmetric Consensus Change, and Overall Intervention Effectiveness) to measure bias reduction with statistical rigor. Structured prompting experiments demonstrate promising bias mitigation, achieving 18.1% reduction in cross-perspective disagreement patterns alongside a favorable 2.6:1 improvement-to-deterioration ratio.
The training integration implements a three-phase RL-SFT-RL pipeline encompassing baseline Group Relative Policy Optimization (GRPO), COPO-informed supervised fine-tuning, and transfer assessment through resumed reinforcement learning. This methodology employs multicomponent reward architectures evaluating verdict accuracy, structural compliance, and six dimensional reasoning quality through strong-to-weak supervision. The integrated training achieves 21.9% improvement in ethical reasoning quality, with the model gaining higher rewards after COPO supervised fine-tuning and showing persistence through autonomous learning phases with evidence of knowledge transfer to previously unseen scenarios.
Empirical results suggest that psychology-informed interventions can enhance analytical sophistication while reducing contextual bias susceptibility. The enhanced model demonstrates improved stakeholder consideration, systematic evidence integration, and more consistent moral judgment across varied framings without compromising decision accuracy. This work provides evidence that systematically embedding cognitive debiasing techniques into training protocols may enable AI systems to engage in more balanced reasoning, contributing to methodological foundations for psychology-informed AI alignment approaches.
Descripció
Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2025, Director: Maite López Sánchez
Citació
Citació
LI CHEN, Chengheng. Ethical reasoning in Large Language Models. [consulta: 26 de novembre de 2025]. [Disponible a: https://hdl.handle.net/2445/223848]