Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the scarcity and suboptimal performance of Arabic NLP models, this paper proposes an efficient “translate-and-tune” paradigm. First, a powerful multilingual teacher model—accelerated via FP8 quantization—is employed for knowledge distillation to generate high-fidelity Arabic–English bilingual supervision signals. Second, lightweight student models translate high-quality English instruction data into Arabic, constructing a curated Arabic instruction dataset. Finally, SLERP-based model merging is applied to harmonize specialized Arabic capabilities with general foundational competence. Evaluated across model scales from 350M to 9B parameters, our approach achieves significant gains in Arabic understanding and generation, consistently outperforming baselines on major Arabic benchmarks. To foster community advancement, we publicly release all trained models, the Arabic instruction dataset, and training code—systematically supporting the development of Arabic NLP research and applications.

Technology Category

Application Category

📝 Abstract

We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$leftrightarrow$EN teacher to FP8 (yielding $sim$2$ imes$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" ($leq$2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.

Problem

Research questions and friction points this paper is trying to address.

Building Arabic-centric instruction and translation models

Creating high-fidelity bilingual supervision data

Translating English instruction sets into Arabic

Innovation

Methods, ideas, or system contributions that make the work stand out.

Translate-and-tune pipeline with compressed teacher

Lightweight model fine-tuned on bilingual supervision

Slerp merging balances Arabic specialization with base

🔎 Similar Papers

No similar papers found.