Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

📅 2024-12-23

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing adult content detection tools lack language-specific adaptation for morphologically rich languages like Polish, hindering accurate identification of nuanced, context-dependent pornographic discourse. Method: We introduce forePLay—the first multidimensional annotated dataset for Polish pornographic language (24,000+ sentences), systematically labeled across three fine-grained dimensions: ambiguity, violence, and social unacceptability. We establish the first Polish-specific annotation framework and rigorously evaluate monolingual Transformer models (e.g., Polish RoBERTa), demonstrating their superiority over multilingual baselines. To address class imbalance and ensure annotation reliability, we apply resampling techniques and multi-annotator consistency validation. Contribution/Results: Our approach yields a 12.7% improvement in macro-F1 score. The forePLay dataset is publicly released, serving as a benchmark resource and methodological blueprint for content safety research in Slavic languages.

Technology Category

Application Category

📝 Abstract

The surge in online content has created an urgent demand for robust detection systems, especially in non-English contexts where current tools demonstrate significant limitations. We present forePLay, a novel Polish language dataset for erotic content detection, featuring over 24k annotated sentences with a multidimensional taxonomy encompassing ambiguity, violence, and social unacceptability dimensions. Our comprehensive evaluation demonstrates that specialized Polish language models achieve superior performance compared to multilingual alternatives, with transformer-based architectures showing particular strength in handling imbalanced categories. The dataset and accompanying analysis establish essential frameworks for developing linguistically-aware content moderation systems, while highlighting critical considerations for extending such capabilities to morphologically complex languages.

Problem

Research questions and friction points this paper is trying to address.

Polish Language

Adult Content Detection

Complex Grammar Handling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Polish-specific dialogue database

Imbalanced classification improvement

Multilingual content detection foundation

🔎 Similar Papers

No similar papers found.