Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This work addresses the substantial disagreement among human annotators and explanations in subjective natural language processing tasks, such as hate speech detection, which existing evaluation methods struggle to capture effectively. The authors propose a unified evaluation framework that systematically compares hard, intermediate, and soft representations of labels and rationales under identical models, training strategies, and protocols. Classification performance is assessed along predictive and distributional dimensions, while explainability is evaluated through plausibility, faithfulness, and complexity. Experimental results demonstrate that soft representations—and their corresponding evaluation metrics—more effectively capture the diversity inherent in human judgments, highlighting their advantages for subjective NLP tasks and underscoring the need to reconceptualize current evaluation paradigms.
📝 Abstract
Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.
Problem

Research questions and friction points this paper is trying to address.

hate speech detection
human disagreement
rationales
explainability evaluation
subjective NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

rationale representation
explainability evaluation
subjective NLP
soft labels
unified evaluation framework
🔎 Similar Papers
No similar papers found.