🤖 AI Summary
This work addresses the challenge of disentangling the mechanisms underlying cross-lingual generalization in language models, which is confounded in natural corpora by intertwined factors such as lexical overlap, morphological variation, and data imbalance. To isolate these variables, the authors propose an in vitro experimental framework that procedurally generates two synthetic languages sharing identical ontologies and syntactic structures but differing in surface forms. By systematically controlling lexical distance, minority-language proportion, tokenization strategy, and vocabulary size across 70 controlled experiments, they find that transfer performance hinges not on lexical similarity but on whether the tokenizer preserves reusable cross-lingual subword structures. A smaller vocabulary enhances decomposability of words, thereby improving masked language modeling transfer. Moreover, cross-lingual transfer exhibits a staged pattern, prioritizing grammar and typology over lexical alignment.
📝 Abstract
Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.