ElChat: Adapting Chat Language Models Using Only Target Unlabeled Language Data

📅 2024-12-16

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the scarcity of annotated dialogue data for low-resource languages and the difficulty of adapting multilingual large language models (LLMs) to them, this paper proposes a direct adaptation method that requires no base model and no target-language labeled data. Methodologically, it abandons conventional vocabulary expansion and vector-space transfer paradigms, instead employing weight-space knowledge distillation and implicit chat-capability activation to perform lightweight parameter fine-tuning on unlabeled target-language corpora, augmented with structured regularization to mitigate capability forgetting. Experiments demonstrate that the approach consistently outperforms cross-lingual (CV) baselines across multiple dimensions—including multilingual understanding, safety alignment, English retention, and instruction following—while yielding more robust target-language performance and substantially reducing training overhead.

Technology Category

Application Category

📝 Abstract

Vocabulary expansion (VE) is the de-facto approach to language adaptation of large language models (LLMs) by adding new tokens and continuing pre-training on target data. While this is effective for base models trained on unlabeled data, it poses challenges for chat models trained to follow instructions through labeled conversation data. Directly adapting the latter with VE on target unlabeled data may result in forgetting chat abilities. While ideal, target chat data is often unavailable or costly to create for low-resource languages, and machine-translated alternatives are not always effective. To address this issue, previous work proposed using a base and chat model from the same family. This method first adapts the base LLM with VE on target unlabeled data and then converts it to a chat model by adding a chat vector (CV) derived from the weight difference between the source base and chat models. We propose ElChat, a new language adaptation method for chat LLMs that adapts a chat model directly on target unlabeled data, without a base model. It elicits chat abilities by injecting information from the source chat model. ElChat offers more robust and competitive target language and safety performance while achieving superior English, chat, and instruction-following abilities compared to CV.

Problem

Research questions and friction points this paper is trying to address.

Adapting chat LLMs to target languages without labeled data

Preventing chat ability loss during language adaptation

Improving performance in low-resource language scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Directly adapts chat models using unlabeled data

Injects source chat model information for abilities

Achieves superior performance without base model

🔎 Similar Papers

No similar papers found.