🤖 AI Summary
This study addresses algorithmic bias in hate speech detection arising from annotator–target group identity mismatches. We propose a fairness optimization method grounded in persona modeling, pioneering the integration of social-psychological group identity theory into NLP. Specifically, we construct personalized large language models (Persona-LLMs) that explicitly incorporate annotators’ sociodemographic attributes (e.g., gender, race, religion) via shallow persona prompting and RAG-enhanced deep contextualized persona modeling. Experiments on Gemini and GPT-4.1-mini demonstrate significant improvements in cross-group fairness—particularly reduced false positives on minority-group texts—validating the efficacy and practical boundaries of persona-based modeling for mitigating identity-related bias. Our core contribution is a novel, interpretable, and controllable identity-aware detection paradigm, offering both theoretical insights and a technical framework for fair NLP.
📝 Abstract
In this paper, we investigate how personalising Large Language Models (Persona-LLMs) with annotator personas affects their sensitivity to hate speech, particularly regarding biases linked to shared or differing identities between annotators and targets. To this end, we employ Google's Gemini and OpenAI's GPT-4.1-mini models and two persona-prompting methods: shallow persona prompting and a deeply contextualised persona development based on Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles. We analyse the impact of using in-group and out-group annotator personas on the models' detection performance and fairness across diverse social groups. This work bridges psychological insights on group identity with advanced NLP techniques, demonstrating that incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection. Our results highlight both the potential and limitations of persona-based approaches in reducing bias, offering valuable insights for developing more equitable hate speech detection systems.