🤖 AI Summary
This work addresses the challenge of detecting context-dependent toxic language—such as implicit gender bias, harassment, and abuse—in conversational settings. We propose the first multi-granularity toxicity analysis framework specifically designed for dialogue. Methodologically, it integrates message-level classification with dialogue-level modeling, innovatively incorporating toxicity-aware dialogue summarization and persona profiling. Furthermore, we design a perplexity-gain-based mechanism to enhance interpretability. Extensive experiments on established benchmarks—including EDOS, OffensEval, and HatEval—demonstrate state-of-the-art performance: our approach significantly improves fine-grained gender-bias detection accuracy while enabling cross-message toxicity tracking, context-aware summarization, and behavioral persona characterization. The framework thus achieves both strong discriminative capability and principled interpretability.
📝 Abstract
Detecting toxic language including sexism, harassment and abusive behaviour, remains a critical challenge, particularly in its subtle and context-dependent forms. Existing approaches largely focus on isolated message-level classification, overlooking toxicity that emerges across conversational contexts. To promote and enable future research in this direction, we introduce SafeSpeech, a comprehensive platform for toxic content detection and analysis that bridges message-level and conversation-level insights. The platform integrates fine-tuned classifiers and large language models (LLMs) to enable multi-granularity detection, toxic-aware conversation summarization, and persona profiling. SafeSpeech also incorporates explainability mechanisms, such as perplexity gain analysis, to highlight the linguistic elements driving predictions. Evaluations on benchmark datasets, including EDOS, OffensEval, and HatEval, demonstrate the reproduction of state-of-the-art performance across multiple tasks, including fine-grained sexism detection.