Rezwan: Leveraging Large Language Models for Comprehensive Hadith Text Processing: A 1.2M Corpus Development

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses key bottlenecks in Hadith digitization—high manual annotation costs, weak multilingual support, and insufficient semantic depth—by proposing the first fully automated processing paradigm powered by large language models (LLMs). We introduce Rezwan, an AI-augmented Hadith corpus comprising over 1.2 million entries, enabling end-to-end processing: text segmentation, narrator-chain–text separation, multi-layer semantic annotation (topic, summary, transliteration), cross-text semantic analysis, and multilingual machine translation, all integrated with automated validation. Evaluation on 1,213 samples shows near-human performance in chain–text separation and summarization (9.33/10) and an overall quality score of 8.46/10—significantly surpassing the Noor Corpus (3.66/10). Processing cost is reduced to a negligible fraction of manual methods. The framework delivers a scalable, high-fidelity, multilingual semantic infrastructure for digital humanities and Islamic studies.

Technology Category

Application Category

📝 Abstract
This paper presents the development of Rezwan, a large-scale AI-assisted Hadith corpus comprising over 1.2M narrations, extracted and structured through a fully automated pipeline. Building on digital repositories such as Maktabat Ahl al-Bayt, the pipeline employs Large Language Models (LLMs) for segmentation, chain--text separation, validation, and multi-layer enrichment. Each narration is enhanced with machine translation into twelve languages, intelligent diacritization, abstractive summarization, thematic tagging, and cross-text semantic analysis. This multi-step process transforms raw text into a richly annotated research-ready infrastructure for digital humanities and Islamic studies. A rigorous evaluation was conducted on 1,213 randomly sampled narrations, assessed by six domain experts. Results show near-human accuracy in structured tasks such as chain--text separation (9.33/10) and summarization (9.33/10), while highlighting ongoing challenges in diacritization and semantic similarity detection. Comparative analysis against the manually curated Noor Corpus demonstrates the superiority of Najm in both scale and quality, with a mean overall score of 8.46/10 versus 3.66/10. Furthermore, cost analysis confirms the economic feasibility of the AI approach: tasks requiring over 229,000 hours of expert labor were completed within months at a fraction of the cost. The work introduces a new paradigm in religious text processing by showing how AI can augment human expertise, enabling large-scale, multilingual, and semantically enriched access to Islamic heritage.
Problem

Research questions and friction points this paper is trying to address.

Automating Hadith text segmentation and chain-text separation
Enhancing narrations with multilingual translation and semantic analysis
Creating scalable AI infrastructure for Islamic digital humanities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline extracts Hadith narrations using LLMs
Multi-layer enrichment includes translation and semantic analysis
AI approach achieves near-human accuracy at low cost
🔎 Similar Papers
No similar papers found.
Majid Asgari-Bidhendi
Majid Asgari-Bidhendi
Iran University of Science and Technology
M
Muhammad Amin Ghaseminia
Iran University of Science and Technology
A
Alireza Shahbazi
Noor Avaran Jelvehaye Maanaei Najm Co.
S
Sayyed Ali Hossayni
Noor Avaran Jelvehaye Maanaei Najm Co.
N
Najmeh Torabian
Islamic Azad University
Behrouz Minaei-Bidgoli
Behrouz Minaei-Bidgoli
Professor, School of Computer Engineering, Iran University of Science and Technology
Data MiningNatural Language ProcessingMachine Learning