GRDD: A Dataset for Greek Dialectal NLP

📅 2023-08-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Computational research on Modern Greek dialects is hindered by the scarcity of large-scale, diverse, annotated corpora—particularly for four key varieties: Cretan, Pontic, Northern Greek, and Cypriot. Method: We introduce the first large-scale, high-quality textual dataset covering all four dialects and conduct dialect identification using linguistically informed features combined with both traditional machine learning (SVM, XGBoost) and lightweight deep models (LSTM, CNN). Contribution/Results: Our experiments demonstrate that even simple models achieve high classification accuracy, confirming substantial linguistic discriminability among these dialects. Error analysis identifies data cleaning quality—not model complexity—as the primary performance bottleneck. This work establishes the first publicly available, reproducible benchmark for multi-dialect Modern Greek computation, addressing a critical resource gap and offering methodological insights for low-resource dialect processing.

📝 Abstract

In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.

Problem

Research questions and friction points this paper is trying to address.

Creating a dataset for computational study of Modern Greek dialects

Performing dialect identification using machine learning approaches

Analyzing model performance and error patterns in dialect classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created dataset for Modern Greek dialects

Applied traditional ML and DL algorithms

Performed error analysis on top models

🔎 Similar Papers

Towards Systematic Monolingual NLP Surveys: GenA of Greek NLP