🤖 AI Summary
This work addresses core challenges in dense retrieval for Amharic—a low-resource language with 120 million speakers—including scarcity of labeled data, pretraining resources, and word embeddings. We present the first systematic feasibility study, proposing a lightweight fine-tuning and cross-lingual transfer framework built upon mBERT and XLM-R. Our approach integrates contrastive learning, pseudo-labeling, and unsupervised domain adaptation to train dense encoders. Evaluated on a newly constructed Amharic QA retrieval benchmark—the first of its kind—we achieve a 37% improvement in Recall@10 over baseline methods, substantially outperforming traditional sparse retrieval and zero-shot cross-lingual baselines. Key contributions are: (1) the first publicly available Amharic dense retrieval benchmark; (2) empirical validation of lightweight adaptation and cross-lingual transfer efficacy in low-resource settings; and (3) a reproducible methodology for information retrieval in African languages.
📝 Abstract
This paper reports some difficulties and some results when using dense retrievers on Amharic, one of the low-resource languages spoken by 120 millions populations. The efforts put and difficulties faced by University Addis Ababa toward Amharic Information Retrieval will be developed during the presentation.