When the Model Said'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

πŸ“… 2026-02-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the β€œaxis collapse” problem in multi-objective alignment, where conflicting objectives fragment the feature space, leading to catastrophic forgetting and inaccurate expert routing. To mitigate this, the authors propose AlignX, a two-stage framework: the first stage employs prompt-injected fine-tuning to extract axis-specific features, thereby alleviating forgetting; the second introduces the MoCaE module, grounded in fractal and natural geometric principles, to calibrate expert routing and enhance inference reliability. AlignX is the first method to formally define and resolve axis collapse, achieving a 171.5% increase in win rate on Alpaca, a 110.1% improvement in truthfulness on TruthfulQA, and a 4.3% reduction in safety violations, while simultaneously cutting latency and memory overhead by over 35%. The approach demonstrates strong generalization across four large language models.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Multi-objective Alignment
Helpfulness
Harmlessness
Honesty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Axis Collapse
AlignX
Prompt-injected Fine-tuning
MoCaE
Multi-objective Alignment
πŸ”Ž Similar Papers
No similar papers found.