C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

Stereotypical biases (e.g., gender, race) and structural biases (e.g., lexical overlap, positional preference) in large language models often co-occur due to spurious feature correlations in inputs; existing mitigation methods address them in isolation, risking bias transfer. This paper proposes Causal Contrastive Preference Optimization (CCPO), the first framework unifying both bias types under a shared causal origin—latent-variable-driven spurious correlations. CCPO introduces counterfactual interventions to disentangle bias pathways and designs a fairness-aware, logit-level dynamic attribution and preference update mechanism. Evaluated on nine bias benchmarks—including BBQ and HANS—CCPO significantly reduces bias while preserving model capability, as evidenced by no performance degradation on major competence benchmarks such as MMLU and GSM8K. The approach thus achieves synergistic improvement in both fairness and task performance.

Technology Category

Application Category

📝 Abstract

Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Diagnosing and disentangling bias shortcuts in LLMs

Addressing stereotypical and structural biases simultaneously

Suppressing spurious feature correlations to preserve reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal counterfactual signals isolate bias-inducing features

Fairness-sensitive preference update suppresses shortcut features dynamically

Unified alignment framework discovers and suppresses spurious correlations

🔎 Similar Papers

No similar papers found.