Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech

📅 2025-08-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates large language models (LLMs) on explicit and implicit hate speech detection, focusing on how safety alignment affects classification objectivity, fairness, and semantic understanding. We conduct multidimensional comparative experiments—assessing accuracy, robustness, cross-group fairness, confidence calibration, and irony detection—between strictly safety-aligned and unaligned models. Results show that aligned models achieve higher accuracy (78.7% vs. 64.1%) but exhibit pronounced ideological anchoring, inter-group fairness disparities, systematic overconfidence, and near-universal failure in irony recognition. Unaligned models, while more flexible, remain vulnerable to latent value-laden framing. Crucially, this work is the first to empirically expose a fundamental tension between safety alignment and epistemic objectivity. We propose the “alignment–bias–robustness” triadic trade-off framework, offering both theoretical grounding and empirical benchmarks for designing trustworthy content moderation systems.

Technology Category

Application Category

📝 Abstract
We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining whether models with minimal safety alignment (uncensored) might provide more objective classification capabilities compared to their heavily-aligned (censored) counterparts. While uncensored models theoretically offer a less constrained perspective free from moral guardrails that could bias classification decisions, our results reveal a surprising trade-off: censored models significantly outperform their uncensored counterparts in both accuracy and robustness, achieving 78.7% versus 64.1% strict accuracy. However, this enhanced performance comes with its own limitation -- the safety alignment acts as a strong ideological anchor, making censored models resistant to persona-based influence, while uncensored models prove highly malleable to ideological framing. Furthermore, we identify critical failures across all models in understanding nuanced language such as irony. We also find alarming fairness disparities in performance across different targeted groups and systemic overconfidence that renders self-reported certainty unreliable. These findings challenge the notion of LLMs as objective arbiters and highlight the need for more sophisticated auditing frameworks that account for fairness, calibration, and ideological consistency.
Problem

Research questions and friction points this paper is trying to address.

Evaluating safety alignment impact on hate speech detection objectivity
Assessing ideological bias versus classification accuracy trade-offs in LLMs
Identifying model failures in irony understanding and fairness disparities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparing safety-aligned and uncensored models for hate speech detection
Evaluating ideological bias versus classification accuracy trade-offs
Assessing model robustness to persona-based influence and irony
🔎 Similar Papers
No similar papers found.
S
Sanjeeevan Selvaganapathy
Network Analysis and Social Influence Modeling (NASIM) Lab, School of Physics Maths and Computing, The University of Western Australia
Mehwish Nasim
Mehwish Nasim
Senior Lecturer UWA Australia, Network Analysis & Social Influence Modelling (NASIM)Lab
Information WarfareNetwork ScienceNLPComplex SystemsPsyOp