Baichuan 2: Open Large-scale Language Models

📅 2023-09-19

🏛️ arXiv.org

📈 Citations: 658

✨ Influential: 77

career value

239K/year

🤖 AI Summary

To address the limited multilingual capabilities of existing open-source large language models (LLMs) and constraints imposed by closed ecosystems, this work introduces Baichuan 2 (7B/13B)—the first open-source, multilingual LLM trained from scratch to natively support both Chinese and English while enhancing domain-specific proficiency in medicine and law. Leveraging a high-quality, 2.6-trillion-token multilingual corpus, we propose a custom Transformer architecture, multi-stage curriculum pretraining, and fine-grained corpus mixing strategies. Baichuan 2 is the first open-source model of its scale to outperform Llama 2 and ChatGLM2 across comprehensive benchmarks—including CMMLU, GSM8K, HumanEval, and MMLU—while achieving state-of-the-art accuracy on medical and legal reasoning tasks. All full pretraining checkpoints are publicly released to facilitate reproducible research and training dynamics analysis.

📝 Abstract

Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.

Problem

Research questions and friction points this paper is trying to address.

Develop open-source multilingual large language models

Enhance performance in specialized domains like medicine

Provide pre-training checkpoints for research community

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual large-scale models with 7B/13B parameters

Trained from scratch on 2.6 trillion tokens

Open-source pre-training checkpoints for research

🔎 Similar Papers

No similar papers found.