🤖 AI Summary
Persian NLP research has long been hindered by small-scale, low-diversity, and inconsistently curated text corpora, limiting the development of both conventional NLP models and open-source large language models (LLMs). To address this, we introduce the largest high-quality open Persian corpus to date—72.9 billion tokens—spanning diverse domains and genres. Our methodology features rigorous deduplication (via MinHash + LSH), fine-grained quality filtering based on linguistic and statistical heuristics, and standardized preprocessing. We fully open-source all preprocessing code and comprehensive documentation. Transformer models trained on this corpus achieve significant improvements over strong baselines across summarization, named entity recognition (NER), and language modeling. The corpus and tooling are publicly released under permissive licenses and have already enabled multiple Persian LLM initiatives. This work fills a critical gap in the NLP ecosystem: a large-scale, high-fidelity, reproducible foundational resource for Persian language modeling.
📝 Abstract
Text corpora are essential for training models used in tasks like summarization, translation, and large language models (LLMs). While various efforts have been made to collect monolingual and multilingual datasets in many languages, Persian has often been underrepresented due to limited resources for data collection and preprocessing. Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. This shortage of high-quality, varied data has slowed the development of NLP models and open-source LLMs for Persian. Since model performance depends heavily on the quality of training data, we address this gap by introducing the Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality. We further assess its effectiveness by training and evaluating transformer-based models on key NLP tasks. Both the dataset and preprocessing codes are publicly available, enabling researchers to build on and improve this resource for future Persian NLP advancements.