Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

📅 2024-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing adult content detection tools lack language-specific adaptation for morphologically rich languages like Polish, hindering accurate identification of nuanced, context-dependent pornographic discourse. Method: We introduce forePLay—the first multidimensional annotated dataset for Polish pornographic language (24,000+ sentences), systematically labeled across three fine-grained dimensions: ambiguity, violence, and social unacceptability. We establish the first Polish-specific annotation framework and rigorously evaluate monolingual Transformer models (e.g., Polish RoBERTa), demonstrating their superiority over multilingual baselines. To address class imbalance and ensure annotation reliability, we apply resampling techniques and multi-annotator consistency validation. Contribution/Results: Our approach yields a 12.7% improvement in macro-F1 score. The forePLay dataset is publicly released, serving as a benchmark resource and methodological blueprint for content safety research in Slavic languages.

Technology Category

Application Category

📝 Abstract
The surge in online content has created an urgent demand for robust detection systems, especially in non-English contexts where current tools demonstrate significant limitations. We present forePLay, a novel Polish language dataset for erotic content detection, featuring over 24k annotated sentences with a multidimensional taxonomy encompassing ambiguity, violence, and social unacceptability dimensions. Our comprehensive evaluation demonstrates that specialized Polish language models achieve superior performance compared to multilingual alternatives, with transformer-based architectures showing particular strength in handling imbalanced categories. The dataset and accompanying analysis establish essential frameworks for developing linguistically-aware content moderation systems, while highlighting critical considerations for extending such capabilities to morphologically complex languages.
Problem

Research questions and friction points this paper is trying to address.

Polish Language
Adult Content Detection
Complex Grammar Handling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Polish-specific dialogue database
Imbalanced classification improvement
Multilingual content detection foundation
🔎 Similar Papers
No similar papers found.
A
Anna Kołos
NASK National Research Institute
Katarzyna Lorenc
Katarzyna Lorenc
NASK - National Research Institute
E
Emilia Wiśnios
Independent Researcher
Agnieszka Karlińska
Agnieszka Karlińska
NASK National Research Institute