Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates whether reasoning models, while enhancing multi-step reasoning capabilities, inadvertently compromise their original alignment behaviors—such as safety refusals, bias mitigation, and privacy preservation. Through systematic auditing, the authors compare reasoning models derived via supervised fine-tuning, reinforcement learning-based post-training, and knowledge distillation against baseline instruction-tuned models across six trustworthiness dimensions, quantifying behavioral drift using KL divergence. The work reveals, for the first time, that despite superior reasoning performance, these models consistently exhibit alignment degradation, manifesting as increased toxicity, amplified stereotyping, inaccurate refusal responses, and heightened privacy leakage. In light of these findings, the study advocates for the integration of trustworthiness metrics into the evaluation framework for reasoning models.

📝 Abstract

Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.

Problem

Research questions and friction points this paper is trying to address.

reasoning models

alignment preservation

trustworthiness

instruction-tuned LLMs

behavioral drift

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning models

alignment preservation

trustworthiness evaluation