🤖 AI Summary
Low-resource languages suffer from severe limitations in NLP applications due to the scarcity of bilingual parallel corpora. To address this, we propose a fully automatic and scalable cross-modal parallel data augmentation method that jointly models visual and textual structures directly from multilingual newspaper images, enabling end-to-end extraction of high-quality bilingual sentence pairs. Our approach integrates OCR, news layout analysis, multilingual text recognition, and robust text alignment—requiring no manual annotation or predefined templates. Its key innovation lies in unifying image layout understanding with semantic text alignment within a single framework, significantly improving generalization across languages and diverse newspaper layouts. We construct the first historical-newspaper-based parallel corpora for two low-resource language pairs. When used for machine translation, our automatically extracted data yields a +2.9 BLEU improvement over strong baselines, empirically validating both the quality and practical utility of the generated resources.
📝 Abstract
Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.