๐ค AI Summary
This work addresses two critical challenges in Arabic large language model (LLM) text detection: poor cross-genre generalization and the absence of high-quality benchmark datasets. To this end, we introduce ArabDetectโthe first large-scale, multi-genre (news, social media, reviews), multi-variant (Modern Standard Arabic and dialectal Arabic) human-vs.-machine text discrimination dataset for Arabic. ArabDetect employs multi-source sampling, balanced class and genre distributions, and standardized train/validation/test splits. Systematic evaluation reveals substantial performance degradation under cross-genre settings, with news texts proving most challenging to detect. Extensive experiments compare traditional classifiers, BERT-based models, and LLM-based zero-/few-shot approaches; fine-tuned BERT achieves the best overall accuracy, yet cross-genre robustness remains limited. This work establishes a reproducible benchmark for Arabic AIGC detection, delivers empirical insights into genre-specific detection difficulty, and offers methodological guidance for future research.
๐ Abstract
We introduce ALHD, the first large-scale comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection. We provide rigorous preprocessing, rich annotations, and standardized balanced splits to support reproducibility. In addition, we present, analyze and discuss benchmark experiments using our new dataset, in turn identifying gaps and proposing future research directions. Benchmarking across traditional classifiers, BERT-based models, and LLMs (zero-shot and few-shot) demonstrates that fine-tuned BERT models achieve competitive performance, outperforming LLM-based models. Results are however not always consistent, as we observe challenges when generalizing across genres; indeed, models struggle to generalize when they need to deal with unseen patterns in cross-genre settings, and these challenges are particularly prominent when dealing with news articles, where LLM-generated texts resemble human texts in style, which opens up avenues for future research. ALHD establishes a foundation for research related to Arabic LLM-detection and mitigating risks of misinformation, academic dishonesty, and cyber threats.