7B Fully Open Source Moxin-LLM -- From Pretraining to GRPO-based Reinforcement Learning Enhancement

📅 2024-12-08
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Current open-source large language models (LLMs) suffer from insufficient transparency, poor reproducibility, and security risks—particularly due to the lack of publicly available training code, data, and configurations. To address these challenges, we propose a full-stack open-science paradigm—open methodology, source code, training data, and model weights—and release Moxin-LLM, a fully reproducible, commercially licensable 7B-parameter model. Methodologically, we introduce Group Relative Policy Optimization (GRPO) for the first time to enhance reasoning in small-scale LLMs, and systematically integrate four sequential training stages: pretraining, instruction tuning, chain-of-thought (CoT) distillation, and GRPO-based reinforcement learning, leveraging knowledge transfer from DeepSeek R1 and an efficient RLHF variant. Experiments demonstrate that Moxin Reasoning consistently outperforms comparable open-source 7B models across zero-shot, few-shot, and CoT-based reasoning benchmarks, with substantial gains in logical reasoning—while ensuring end-to-end openness and commercial compliance.

Technology Category

Application Category

📝 Abstract
Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed, adhering to principles of open science, open source, open data, and open access. We release the pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints, aiming to make continuous commitments to fully open-source LLMs. After pre-training and obtaining the base model, we finetune the Moxin Base model with SOTA post-training framework and instruction data to obtain Moxin Instruct model. To improve the reasoning capability, we further finetune our Instruct model with chain-of-thought data distilled from DeepSeek R1, and then use Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm following DeepSeek R1, to finetune our model, leading to the Moxin Reasoning model. Experiments show that our models achieve superior performance in various evaluations such as zero-shot evaluation, few-shot evaluation, and CoT evaluation.
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of transparency in open-source LLMs by releasing full training data and code.
Enhances LLM reasoning via chain-of-thought fine-tuning and GRPO reinforcement learning.
Demonstrates superior performance in zero-shot, few-shot, and CoT evaluations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully open-source LLM with complete transparency
GRPO-based reinforcement learning enhancement
Chain-of-thought data distillation for reasoning
🔎 Similar Papers
No similar papers found.
P
Pu Zhao
Northeastern University
Xuan Shen
Xuan Shen
Cornell Tech, Northeastern University
Efficient Deep LearningML SystemsAutoML
Zhenglun Kong
Zhenglun Kong
Harvard University
Efficient Deep LearningLarge Language ModelAI4Science
Yixin Shen
Yixin Shen
Inria Rennes
Quantum AlgorithmsCryptography
Sung-En Chang
Sung-En Chang
Northeastern
model compressionmachine learningdeep learningquantizationefficient training
T
Timothy Rupprecht
Northeastern University
L
Lei Lu
Northeastern University
E
Enfu Nan
Northeastern University
Changdi Yang
Changdi Yang
PhD candidate, Northeastern University, Snap Inc.
Efficient Deep Learning
Y
Yumei He
Tulane University
X
Xingchen Xu
University of Washington
Y
Yu Huang
Roboraction.ai
W
Wei Wang
Futurewei Technologies
Y
Yue Chen
Futurewei Technologies
Y
Yong He
Futurewei Technologies
Y
Yanzhi Wang
Northeastern University, AIBAO LLC