π€ AI Summary
Arabic text often suffers from ambiguous or missing punctuation, posing significant challenges for sentence boundary detection, and existing methods lack robustness in real-world scenarios. To address this, this work introduces AraSEG, a novel Arabic sentence segmentation corpus encompassing eight distinct genres and diverse punctuation styles and document structures. The study presents the first systematic evaluation of large language models, lightweight encoders, and dependency parsers under multi-genre and weak-punctuation conditions. Experimental results demonstrate that lightweight encoders and dependency parsers achieve superior performance in the most challenging settings, and accurate sentence segmentation substantially enhances downstream dependency parsing. Furthermore, the research reveals performance saturation with respect to training data scale and genre diversity, highlighting cross-genre generalization as a persistent challenge.
π Abstract
Sentence segmentation in Arabic is challenging due to ambiguous and inconsistent punctuation, with many texts lacking reliable sentence boundary markers. Existing approaches rely heavily on punctuation cues and are typically evaluated on well-formed text, limiting their robustness in realistic Arabic settings. To address this, we introduce AraSEG, a genre-diverse sentence segmentation corpus spanning eight genres and a wide range of punctuation and document structure conditions. Using AraSEG, we evaluate LLMs, lightweight encoder models, and dependency parser-based models under increasingly challenging segmentation settings. Our experiments show that lightweight encoders, and even dependency parser-based models, outperform LLMs in the most challenging settings. We further investigate the effects of training data size and genre diversity, finding that performance eventually saturates and cross-genre generalization remains challenging. We also demonstrate that accurate sentence segmentation substantially improves downstream dependency parsing. We make our code, data, and models publicly available.