Post-Completion Learning for Language Models

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional language models terminate learning upon generating the <eos> token, leaving the post-<eos> sequence space underutilized. Method: This paper proposes Post-EOS training—a novel white-box reinforcement learning framework that systematically exploits positions beyond <eos> to jointly optimize reasoning and self-assessment. It integrates dual-track supervised fine-tuning (SFT), interpretable reward prediction, and multi-objective reward alignment—without increasing inference latency. Contribution/Results: Evaluated across multiple benchmarks, the approach significantly outperforms standard SFT and black-box RL methods, improving factual consistency, logical rigor, and self-evaluation accuracy. It establishes a new paradigm for post-training large language models by transforming otherwise idle post-<eos> tokens into structured learning signals.

Technology Category

Application Category

📝 Abstract
Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos>}) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.
Problem

Research questions and friction points this paper is trying to address.

Enhances reasoning and self-evaluation in language models
Utilizes post-completion space for continued learning
Improves output quality while maintaining deployment efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-Completion Learning extends training beyond <eos>
White-box RL aligns self-assessment with reward functions
Dual-track SFT and RL hybrid optimize reasoning and evaluation
🔎 Similar Papers
No similar papers found.
X
Xiang Fei
ByteDance
S
Siqi Wang
ByteDance
S
Shu Wei
ByteDance
Yuxiang Nie
Yuxiang Nie
Hong Kong University of Science and Technology
Natural language processingMulti-modal LearningMedical Image Analysis
W
Wei Shi
ByteDance
H
Hao Feng
ByteDance
C
Can Huang
ByteDance