Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address the significant degradation in alignment performance of multilingual large language models (LLMs) for low-resource languages—particularly Hindi—relative to English, this work proposes a selective translation method leveraging LLMs: only natural language segments are translated, while structured, non-translatable content (e.g., code snippets, mathematical expressions, JSON) is automatically identified and preserved intact. Integrated with noise detection filtering and a mixed fine-tuning strategy combining English–Hindi instruction data, the approach is rigorously evaluated on Llama-3.1-405B. Results demonstrate substantial improvements over full-machine-translation baselines (e.g., Google Cloud Translation), yielding marked gains in Hindi alignment. Further, joint training with English-aligned data enhances cross-lingual generalization. This work establishes a reproducible, structure-preserving paradigm for efficient LLM alignment in low-resource languages.

Technology Category

Application Category

📝 Abstract

Multilingual large language models (LLMs) often demonstrate a performance gap between English and non-English languages, particularly in low-resource settings. Aligning these models to low-resource languages is essential yet challenging due to limited high-quality data. While English alignment datasets are readily available, curating equivalent data in other languages is expensive and time-consuming. A common workaround is to translate existing English alignment data; however, standard translation techniques often fail to preserve critical elements such as code, mathematical expressions, and structured formats like JSON. In this work, we investigate LLM-based selective translation, a technique that selectively translates only the translatable parts of a text while preserving non-translatable content and sentence structure. We conduct a systematic study to explore key questions around this approach, including its effectiveness compared to vanilla translation, the importance of filtering noisy outputs, and the benefits of mixing translated samples with original English data during alignment. Our experiments focus on the low-resource Indic language Hindi and compare translations generated by Google Cloud Translation (GCP) and Llama-3.1-405B. The results highlight the promise of selective translation as a practical and effective method for improving multilingual alignment in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Address performance gap between English and low-resource languages in LLMs

Overcome limited high-quality data for aligning LLMs to low-resource languages

Preserve non-translatable content and structure in translated alignment datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based selective translation technique

Preserves non-translatable content and structure

Mixes translated and original English data

🔎 Similar Papers

No similar papers found.