🤖 AI Summary
This work addresses the longstanding challenge of jointly optimizing fairness and performance in language models. We first uncover an intrinsic connection between neural collapse—a geometric phenomenon wherein class-wise representations converge to simplex equiangular tight frames—and model bias: debiasing induces geometric alignment between token representations and word embeddings. Leveraging this insight, we propose a general, geometry-aware fine-tuning framework—the first interpretable and generalizable paradigm for fairness enhancement. Our method integrates neural collapse analysis, representation alignment modeling, and constrained fine-tuning, validated across a comprehensive multi-task fairness evaluation suite. Experiments demonstrate that our approach improves fairness by 23.6% on average across eight mainstream bias benchmarks, while preserving natural language understanding performance with negligible fluctuations (±0.3%).
📝 Abstract
To mitigate societal biases implicitly encoded in recent successful pretrained language models, a diverse array of approaches have been proposed to encourage model fairness, focusing on prompting, data augmentation, regularized fine-tuning, and more. Despite the development, it is nontrivial to reach a principled understanding of fairness and an effective algorithm that can consistently debias language models. In this work, by rigorous evaluations of Neural Collapse -- a learning phenomenon happen in last-layer representations and classifiers in deep networks -- on fairness-related words, we find that debiased language models exhibit collapsed alignment between token representations and word embeddings. More importantly, this observation inspires us to design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods, while still preserving the performance of language models on standard natural language understanding tasks. We attach our code at https://github.com/Xujxyang/Fairness-NC-main.