Matina: A Large-Scale 73B Token Persian Text Corpus

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Persian NLP research has long been hindered by small-scale, low-diversity, and inconsistently curated text corpora, limiting the development of both conventional NLP models and open-source large language models (LLMs). To address this, we introduce the largest high-quality open Persian corpus to date—72.9 billion tokens—spanning diverse domains and genres. Our methodology features rigorous deduplication (via MinHash + LSH), fine-grained quality filtering based on linguistic and statistical heuristics, and standardized preprocessing. We fully open-source all preprocessing code and comprehensive documentation. Transformer models trained on this corpus achieve significant improvements over strong baselines across summarization, named entity recognition (NER), and language modeling. The corpus and tooling are publicly released under permissive licenses and have already enabled multiple Persian LLM initiatives. This work fills a critical gap in the NLP ecosystem: a large-scale, high-fidelity, reproducible foundational resource for Persian language modeling.

Technology Category

Application Category

📝 Abstract
Text corpora are essential for training models used in tasks like summarization, translation, and large language models (LLMs). While various efforts have been made to collect monolingual and multilingual datasets in many languages, Persian has often been underrepresented due to limited resources for data collection and preprocessing. Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. This shortage of high-quality, varied data has slowed the development of NLP models and open-source LLMs for Persian. Since model performance depends heavily on the quality of training data, we address this gap by introducing the Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality. We further assess its effectiveness by training and evaluating transformer-based models on key NLP tasks. Both the dataset and preprocessing codes are publicly available, enabling researchers to build on and improve this resource for future Persian NLP advancements.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale Persian text corpus
Limited diversity in existing Persian datasets
Slowed development of Persian NLP models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Persian text corpus
Preprocessing and deduplication techniques
Transformer-based model training
🔎 Similar Papers
No similar papers found.
S
Sara Bourbour Hosseinbeigi
Tarbiat Modares University
F
Fatemeh Taherinezhad
University of Tehran
Heshaam Faili
Heshaam Faili
Full Professor, University of Tehran
Natural Language ProcessingSocial Network
Hamed Baghbani
Hamed Baghbani
Phd Candidate at University of Tehran
NLP
F
Fatemeh Nadi
University of Tehran
M
Mostafa Amiri
University of Tehran