🤖 AI Summary
To address the limited multilingual capabilities of existing open-source large language models (LLMs) and constraints imposed by closed ecosystems, this work introduces Baichuan 2 (7B/13B)—the first open-source, multilingual LLM trained from scratch to natively support both Chinese and English while enhancing domain-specific proficiency in medicine and law. Leveraging a high-quality, 2.6-trillion-token multilingual corpus, we propose a custom Transformer architecture, multi-stage curriculum pretraining, and fine-grained corpus mixing strategies. Baichuan 2 is the first open-source model of its scale to outperform Llama 2 and ChatGLM2 across comprehensive benchmarks—including CMMLU, GSM8K, HumanEval, and MMLU—while achieving state-of-the-art accuracy on medical and legal reasoning tasks. All full pretraining checkpoints are publicly released to facilitate reproducible research and training dynamics analysis.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.