🤖 AI Summary
This study investigates the mechanistic role of linguistic similarity in cross-lingual transfer, addressing the critical challenge of optimal source-language selection under resource constraints. We systematically evaluate zero-shot and fine-tuned transfer performance of mBERT and XLM-R across 266 typologically diverse languages on part-of-speech tagging, dependency parsing, and topic classification. Leveraging multidimensional language distance metrics—including genealogical, phonological, morphological, and treebank-based measures—we conduct attribution analysis to quantify similarity’s predictive power. Our key finding is that linguistic similarity exhibits strong conditional dependence: its efficacy is significantly moderated by task type (syntactic vs. semantic), model input representation, and the specific definition of similarity. These results fundamentally challenge the oversimplified “similarity implies benefit” assumption, demonstrating that similarity is neither universally predictive nor invariant across tasks or models. The work thus provides a theoretically grounded, task-adaptive framework for principled cross-lingual data selection in low-resource settings.
📝 Abstract
Cross-lingual transfer is a popular approach to increase the amount of training data for NLP tasks in a low-resource context. However, the best strategy to decide which cross-lingual data to include is unclear. Prior research often focuses on a small set of languages from a few language families and/or a single task. It is still an open question how these findings extend to a wider variety of languages and tasks. In this work, we analyze cross-lingual transfer for 266 languages from a wide variety of language families. Moreover, we include three popular NLP tasks: POS tagging, dependency parsing, and topic classification. Our findings indicate that the effect of linguistic similarity on transfer performance depends on a range of factors: the NLP task, the (mono- or multilingual) input representations, and the definition of linguistic similarity.