🤖 AI Summary
This paper investigates the conceptual structure of racial bias in large language models (LLMs) and the feasibility of generalizable mitigation strategies. We employ neuron-level and attention-head-level interpretability-based pruning, coupled with causal attribution analysis across multi-domain bias benchmarks—including finance and business. Our study is the first to systematically reveal that racial bias exhibits a “partially universal + highly contextual” dual representation structure. While pruning effectively reduces bias on specific tasks without compromising model robustness, its cross-task generalizability is extremely limited: mitigation strategies trained on financial benchmarks fail in business tasks at a rate of 73%. These findings indicate that bias mitigation cannot rely on domain-agnostic techniques; instead, context-specific accountability mechanisms—grounded in legal and operational responsibilities—are urgently needed to ensure equitable deployment.
📝 Abstract
We employ model pruning to examine how LLMs conceptualize racial biases, and whether a generalizable mitigation strategy for such biases appears feasible. Our analysis yields several novel insights. We find that pruning can be an effective method to reduce bias without significantly increasing anomalous model behavior. Neuron-based pruning strategies generally yield better results than approaches pruning entire attention heads. However, our results also show that the effectiveness of either approach quickly deteriorates as pruning strategies become more generalized. For instance, a model that is trained on removing racial biases in the context of financial decision-making poorly generalizes to biases in commercial transactions. Overall, our analysis suggests that racial biases are only partially represented as a general concept within language models. The other part of these biases is highly context-specific, suggesting that generalizable mitigation strategies may be of limited effectiveness. Our findings have important implications for legal frameworks surrounding AI. In particular, they suggest that an effective mitigation strategy should include the allocation of legal responsibility on those that deploy models in a specific use case.