Preserving Multilingual Quality While Tuning Query Encoder on English Only

📅 2024-07-01

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the degradation of multilingual representation capability in dual-encoder retrieval systems when fine-tuning only the query encoder on English data. We propose “adiabatic fine-tuning”—a novel paradigm that applies supervised fine-tuning at an extremely low learning rate to preserve and enhance the model’s inherent cross-lingual representational capacity. Our method leverages high-quality multilingual embedding models, couples low-learning-rate optimization with cross-lingual quality evaluation, and requires no translation or additional multilingual supervision. Experiments demonstrate that, after English-only fine-tuning, our approach maintains original non-English retrieval performance while achieving an average +1.2% MRR gain on multilingual benchmarks (e.g., MIRACL, XQuAD) and heterogeneous-domain data—validating effective implicit knowledge transfer. The core contribution is the first adaptation of the adiabatic principle to multilingual representation fine-tuning, offering a scalable, translation-free, and lightweight optimization strategy for cross-lingual retrieval under resource constraints.

Technology Category

Application Category

📝 Abstract

A query encoder of a dual passage retrieval system can be tuned for specific types of queries or domains, while the precomputed and stored documents representations are kept intact. Switching from one query encoder to another when needed is easily feasible, unlike overhauling the embeddings of a whole knowledge base. In this work we raise a question: Can the generic, original qualities of the encoder be preserved or at least left not too degraded when it is tuned on a narrow domain? We conducted experiments on a high quality multilingual embedding model: Tuning it on a single English-only dataset, we observe that the tuning not only preserves the multilingual qualities, but even improves them. The embedding qualities on distinctly different data are also improved or at least preserved. Drawing on our observations, we suggest a more general hypothesis: Tuning with intentionally low learning rate can preserve or improve a system's properties acquired in training, but not specifically targeted by tuning. We call this adiabatic tuning and provide tentative explanations.

Problem

Research questions and friction points this paper is trying to address.

Preserve multilingual quality when tuning query encoder

Evaluate impact of English-only tuning on multilingual embeddings

Explore low learning rate tuning for property preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tuning query encoder on English only

Preserving multilingual qualities via tuning

Adiabatic tuning with low learning rate

🔎 Similar Papers

Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment