Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of native-language instruction data and high-quality human preference annotations for low-resource languages, this paper proposes an online policy optimization method that achieves fluent preference alignment without requiring any target-language human annotations. The approach leverages a non-fluent yet trainable reward model and performs post-training via policy-based reinforcement learning, relying solely on machine-translated data and a multilingual foundation model. Its key innovations are: (i) the first demonstration of simultaneous fluency preservation and preference alignment in the absence of target-language instruction-tuning data; and (ii) elimination of dependence on native speaker annotations or high-fidelity synthetic data. Native speaker evaluations on Norwegian Bokmål show that our method significantly outperforms both machine-translation-supervised fine-tuning and multilingual fine-tuning baselines in both fluency and preference alignment.

Technology Category

Application Category

📝 Abstract
We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.
Problem

Research questions and friction points this paper is trying to address.

Develop fluent preference-aligned models for lower-resource languages
Address lack of native datasets and fluent synthetic data
Enable alignment without instruction-tuning data in target language
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training method for lower-resource languages
On-policy training without target-language data
Preserves fluency with disfluent reward models
🔎 Similar Papers
No similar papers found.