🤖 AI Summary
Conventional re-alignment strategies for multilingual models exhibit unstable performance on low-resource languages (LRLs) and heavily rely on high-quality parallel corpora—resources that are scarce for many LRLs.
Method: We propose *selective re-alignment*, a paradigm that replaces full-language re-alignment with a carefully curated subset of languages selected based on typological diversity metrics, rather than using all available languages.
Contribution/Results: Controlled experiments reveal that not all languages contribute positively to re-alignment; the selected subset matches or even surpasses the cross-lingual transfer performance of the full-language baseline—particularly improving LRL performance by a substantial margin—and demonstrates superior zero-shot generalization to unseen languages. Crucially, selective re-alignment reduces dependence on scarce parallel data, enhancing both robustness and practicality of re-alignment. This approach offers a principled, resource-efficient alternative for multilingual modeling in low-resource settings.
📝 Abstract
Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models. However, empirical results are mixed and often unreliable, particularly for typologically distant or low-resource languages (LRLs) compared to English. Moreover, word realignment tools often rely on high-quality parallel data, which can be scarce or noisy for many LRLs. In this work, we conduct an extensive empirical study to investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer, and study the impact on LRLs. Our controlled experiments show that realignment can be particularly effective for LRLs and that using carefully selected, linguistically diverse subsets can match full multilingual alignment, and even outperform it for unseen LRLs. This indicates that effective realignment does not require exhaustive language coverage and can reduce data collection overhead, while remaining both efficient and robust when guided by informed language selection.