A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource languages suffer from severe limitations in NLP applications due to the scarcity of bilingual parallel corpora. To address this, we propose a fully automatic and scalable cross-modal parallel data augmentation method that jointly models visual and textual structures directly from multilingual newspaper images, enabling end-to-end extraction of high-quality bilingual sentence pairs. Our approach integrates OCR, news layout analysis, multilingual text recognition, and robust text alignment—requiring no manual annotation or predefined templates. Its key innovation lies in unifying image layout understanding with semantic text alignment within a single framework, significantly improving generalization across languages and diverse newspaper layouts. We construct the first historical-newspaper-based parallel corpora for two low-resource language pairs. When used for machine translation, our automatically extracted data yields a +2.9 BLEU improvement over strong baselines, empirically validating both the quality and practical utility of the generated resources.

Technology Category

Application Category

📝 Abstract
Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.
Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity for low-resource languages through automated augmentation
Extracts bilingual corpora from newspaper articles using multimodal analytics
Improves machine translation performance by 3 BLEU points over baseline
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated parallel data extraction from newspaper articles
Combining image and text analytics for corpus creation
Scalable methodology for low-resource language translation
🔎 Similar Papers
No similar papers found.