🤖 AI Summary
This work addresses the challenge of detecting implicit hate speech, which lacks explicit keywords and is deeply embedded in sociocultural contexts, rendering existing detection methods ineffective. The authors propose a novel multi-agent system comprising a central mediator agent and dynamically generated community agents, introducing for the first time a community-driven negotiation mechanism that explicitly integrates sociocultural background knowledge to enable identity-aware hate speech detection. Leveraging large language model prompt engineering, external knowledge integration, and a fairness-oriented evaluation framework centered on balanced accuracy, the proposed approach significantly outperforms state-of-the-art prompting strategies—including zero-shot, few-shot, and chain-of-thought methods—on the ToxiGen dataset. The method not only improves overall detection accuracy but also ensures equitable performance across all targeted demographic groups.
📝 Abstract
This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.