Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

Arabic named entity recognition (NER) models exhibit severe generalization deficits across domains and dialects, primarily due to the absence of comprehensive, multi-dialectal, multi-domain evaluation corpora. To address this, we introduce Konooz—a large-scale, manually annotated corpus covering 16 Arabic dialects and 10 domains, comprising 777K tokens and 21 nested/flat entity types. Konooz is the first resource enabling orthogonal dialect × domain partitioning. We further propose an MMD-based framework to quantify domain and dialect divergence, enabling fine-grained analysis of distributional shifts. Experiments reveal up to 38% performance degradation under low-resource cross-dialect and cross-domain transfer. We publicly release all 160 dialect–domain subcorpora and a fully reproducible benchmark suite. This work establishes the first systematic, large-scale evaluation benchmark for Arabic NER across dialects and domains, coupled with a principled diagnostic toolkit for root-cause analysis of generalization failure.

Technology Category

Application Category

📝 Abstract

We introduce Konooz, a novel multi-dimensional corpus covering 16 Arabic dialects across 10 domains, resulting in 160 distinct corpora. The corpus comprises about 777k tokens, carefully collected and manually annotated with 21 entity types using both nested and flat annotation schemes - using the Wojood guidelines. While Konooz is useful for various NLP tasks like domain adaptation and transfer learning, this paper primarily focuses on benchmarking existing Arabic Named Entity Recognition (NER) models, especially cross-domain and cross-dialect model performance. Our benchmarking of four Arabic NER models using Konooz reveals a significant drop in performance of up to 38% when compared to the in-distribution data. Furthermore, we present an in-depth analysis of domain and dialect divergence and the impact of resource scarcity. We also measured the overlap between domains and dialects using the Maximum Mean Discrepancy (MMD) metric, and illustrated why certain NER models perform better on specific dialects and domains. Konooz is open-source and publicly available at https://sina.birzeit.edu/wojood/#download

Problem

Research questions and friction points this paper is trying to address.

Benchmarking Arabic NER models across 16 dialects and 10 domains

Analyzing performance drop in cross-domain and cross-dialect NER tasks

Investigating domain-dialect divergence and resource scarcity impact on NER

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-domain multi-dialect Arabic NER corpus

Manual annotation with nested and flat schemes

Performance analysis using MMD metric

🔎 Similar Papers

No similar papers found.