🤖 AI Summary
Existing adult content detection tools lack language-specific adaptation for morphologically rich languages like Polish, hindering accurate identification of nuanced, context-dependent pornographic discourse. Method: We introduce forePLay—the first multidimensional annotated dataset for Polish pornographic language (24,000+ sentences), systematically labeled across three fine-grained dimensions: ambiguity, violence, and social unacceptability. We establish the first Polish-specific annotation framework and rigorously evaluate monolingual Transformer models (e.g., Polish RoBERTa), demonstrating their superiority over multilingual baselines. To address class imbalance and ensure annotation reliability, we apply resampling techniques and multi-annotator consistency validation. Contribution/Results: Our approach yields a 12.7% improvement in macro-F1 score. The forePLay dataset is publicly released, serving as a benchmark resource and methodological blueprint for content safety research in Slavic languages.
📝 Abstract
The surge in online content has created an urgent demand for robust detection systems, especially in non-English contexts where current tools demonstrate significant limitations. We present forePLay, a novel Polish language dataset for erotic content detection, featuring over 24k annotated sentences with a multidimensional taxonomy encompassing ambiguity, violence, and social unacceptability dimensions. Our comprehensive evaluation demonstrates that specialized Polish language models achieve superior performance compared to multilingual alternatives, with transformer-based architectures showing particular strength in handling imbalanced categories. The dataset and accompanying analysis establish essential frameworks for developing linguistically-aware content moderation systems, while highlighting critical considerations for extending such capabilities to morphologically complex languages.