π€ AI Summary
This work addresses the βaxis collapseβ problem in multi-objective alignment, where conflicting objectives fragment the feature space, leading to catastrophic forgetting and inaccurate expert routing. To mitigate this, the authors propose AlignX, a two-stage framework: the first stage employs prompt-injected fine-tuning to extract axis-specific features, thereby alleviating forgetting; the second introduces the MoCaE module, grounded in fractal and natural geometric principles, to calibrate expert routing and enhance inference reliability. AlignX is the first method to formally define and resolve axis collapse, achieving a 171.5% increase in win rate on Alpaca, a 110.1% improvement in truthfulness on TruthfulQA, and a 4.3% reduction in safety violations, while simultaneously cutting latency and memory overhead by over 35%. The approach demonstrates strong generalization across four large language models.
π Abstract
Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.