Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer

๐Ÿ“… 2025-10-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the lack of effective source-language ranking mechanisms in zero-shot cross-lingual transfer, this paper proposes NN-Rank: a novel algorithm that dynamically ranks source languages by measuring representation alignment with the target language. Leveraging hidden-layer representations from multilingual pretrained models (mBERT/XLM-R), NN-Rank computes nearest-neighbor similarity using only 25 unlabeled target-language examples. Unlike conventional approaches relying on static linguistic features, NN-Rank enables fully automatic, data-driven source-language selection. Empirical evaluation shows substantial improvements in Normalized Discounted Cumulative Gain (NDCG): +35.56 on part-of-speech tagging across 56 target languages and +18.14 on named entity recognition across 72 languages. Remarkably, even without in-domain target-language data, NN-Rank achieves 92.8% of the performance attained using full target-language corporaโ€”using only 25 examples and general-purpose parallel texts (e.g., the Bible).

Technology Category

Application Category

๐Ÿ“ Abstract
We present NN-Rank, an algorithm for ranking source languages for cross-lingual transfer, which leverages hidden representations from multilingual models and unlabeled target-language data. We experiment with two pretrained multilingual models and two tasks: part-of-speech tagging (POS) and named entity recognition (NER). We consider 51 source languages and evaluate on 56 and 72 target languages for POS and NER, respectively. When using in-domain data, NN-Rank beats state-of-the-art baselines that leverage lexical and linguistic features, with average improvements of up to 35.56 NDCG for POS and 18.14 NDCG for NER. As prior approaches can fall back to language-level features if target language data is not available, we show that NN-Rank remains competitive using only the Bible, an out-of-domain corpus available for a large number of languages. Ablations on the amount of unlabeled target data show that, for subsets consisting of as few as 25 examples, NN-Rank produces high-quality rankings which achieve 92.8% of the NDCG achieved using all available target data for ranking.
Problem

Research questions and friction points this paper is trying to address.

Ranking source languages for zero-shot cross-lingual transfer learning
Using multilingual model representations and unlabeled target data
Improving POS tagging and NER across 51-72 languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages hidden representations from multilingual models
Uses unlabeled target-language data for ranking
Works with minimal data and out-of-domain corpora
๐Ÿ”Ž Similar Papers
No similar papers found.