LRW-Persian: Lip-reading in the Wild Dataset for Persian Language

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Persian lip-reading research has long been hindered by the scarcity of high-quality in-the-wild visual speech data. Method: We introduce PersianLip, the first large-scale benchmark dataset for Persian visual speech recognition—comprising 414K video clips of 743 words, extracted from 67 TV programs, with precise ASR-aligned transcripts, speaker localization, head pose estimation, and multi-dimensional quality annotations. We propose a fully automated end-to-end data curation pipeline featuring active speaker detection, mask-based filtering, and non-overlapping train/test splits. Contribution/Results: Using PersianLip, we systematically evaluate state-of-the-art lip-reading models, establish the first Persian baselines, and identify unique challenges for low-resource languages—including phoneme-level articulatory diversity, illumination robustness, and cross-speaker generalization. The dataset is publicly released to advance assistive technologies for the hearing-impaired and multimodal speech research.

Technology Category

Application Category

📝 Abstract
Lipreading has emerged as an increasingly important research area for developing robust speech recognition systems and assistive technologies for the hearing-impaired. However, non-English resources for visual speech recognition remain limited. We introduce LRW-Persian, the largest in-the-wild Persian word-level lipreading dataset, comprising $743$ target words and over $414{,}000$ video samples extracted from more than $1{,}900$ hours of footage across $67$ television programs. Designed as a benchmark-ready resource, LRW-Persian provides speaker-disjoint training and test splits, wide regional and dialectal coverage, and rich per-clip metadata including head pose, age, and gender. To ensure large-scale data quality, we establish a fully automated end-to-end curation pipeline encompassing transcription based on Automatic Speech Recognition(ASR), active-speaker localization, quality filtering, and pose/mask screening. We further fine-tune two widely used lipreading architectures on LRW-Persian, establishing reference performance and demonstrating the difficulty of Persian visual speech recognition. By filling a critical gap in low-resource languages, LRW-Persian enables rigorous benchmarking, supports cross-lingual transfer, and provides a foundation for advancing multimodal speech research in underrepresented linguistic contexts. The dataset is publicly available at: https://lrw-persian.vercel.app.
Problem

Research questions and friction points this paper is trying to address.

Creating the largest Persian word-level lipreading dataset
Addressing limited non-English visual speech recognition resources
Establishing benchmark for Persian visual speech recognition systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for large-scale Persian lipreading dataset
End-to-end curation with ASR transcription and quality filtering
Fine-tuned architectures establishing Persian visual speech benchmarks
🔎 Similar Papers
No similar papers found.
Z
Zahra Taghizadeh
Department of Mechanical Engineering, Sharif University of Technology, Tehran, Iran
Mohammad Shahverdikondori
Mohammad Shahverdikondori
PhD student, EPFL
Online LearningCausalityComputer Vision
A
Arian Noori
Department of Computer Engineernig, Sharif University of Technology, Tehran, Iran
A
Alireza Dadgarnia
Department of Mathematical Sciences, Sharif University of Technology, Tehran, Iran