Dense Retrieval for Low Resource Languages -- the Case of Amharic Language

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses core challenges in dense retrieval for Amharic—a low-resource language with 120 million speakers—including scarcity of labeled data, pretraining resources, and word embeddings. We present the first systematic feasibility study, proposing a lightweight fine-tuning and cross-lingual transfer framework built upon mBERT and XLM-R. Our approach integrates contrastive learning, pseudo-labeling, and unsupervised domain adaptation to train dense encoders. Evaluated on a newly constructed Amharic QA retrieval benchmark—the first of its kind—we achieve a 37% improvement in Recall@10 over baseline methods, substantially outperforming traditional sparse retrieval and zero-shot cross-lingual baselines. Key contributions are: (1) the first publicly available Amharic dense retrieval benchmark; (2) empirical validation of lightweight adaptation and cross-lingual transfer efficacy in low-resource settings; and (3) a reproducible methodology for information retrieval in African languages.

Technology Category

Application Category

📝 Abstract
This paper reports some difficulties and some results when using dense retrievers on Amharic, one of the low-resource languages spoken by 120 millions populations. The efforts put and difficulties faced by University Addis Ababa toward Amharic Information Retrieval will be developed during the presentation.
Problem

Research questions and friction points this paper is trying to address.

Dense retrieval challenges in Amharic, a low-resource language
Addressing information retrieval difficulties for 120M Amharic speakers
University efforts to improve Amharic IR systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense retrieval for Amharic language
Addressing low-resource language challenges
Collaboration with University Addis Ababa
🔎 Similar Papers
No similar papers found.