GRDD: A Dataset for Greek Dialectal NLP

📅 2023-08-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Computational research on Modern Greek dialects is hindered by the scarcity of large-scale, diverse, annotated corpora—particularly for four key varieties: Cretan, Pontic, Northern Greek, and Cypriot. Method: We introduce the first large-scale, high-quality textual dataset covering all four dialects and conduct dialect identification using linguistically informed features combined with both traditional machine learning (SVM, XGBoost) and lightweight deep models (LSTM, CNN). Contribution/Results: Our experiments demonstrate that even simple models achieve high classification accuracy, confirming substantial linguistic discriminability among these dialects. Error analysis identifies data cleaning quality—not model complexity—as the primary performance bottleneck. This work establishes the first publicly available, reproducible benchmark for multi-dialect Modern Greek computation, addressing a critical resource gap and offering methodological insights for low-resource dialect processing.
📝 Abstract
In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.
Problem

Research questions and friction points this paper is trying to address.

Creating a dataset for computational study of Modern Greek dialects
Performing dialect identification using machine learning approaches
Analyzing model performance and error patterns in dialect classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created dataset for Modern Greek dialects
Applied traditional ML and DL algorithms
Performed error analysis on top models
🔎 Similar Papers
S
S. Chatzikyriakidis
University of Crete
Chatrine Qwaider
Chatrine Qwaider
Researcher
Natural language processingComputational linguisticsArtificial IntelligenceData mining
I
Ilias Kolokousis
University of Leipzig
C
Christina Koula
University of Crete
D
Dimitris Papadakis
University of Crete
E
E. Sakellariou
University of Crete