Reinforced Language Models for Sequential Decision Making

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance limitations of small language models (LLMs) in multi-step sequential decision-making tasks—primarily due to challenges in credit assignment—this paper proposes a reinforcement-based post-training framework tailored for agent-centric decision making. The core method is Multi-Step Group-Relative Policy Optimization (MS-GRPO), which explicitly models inter-step action dependencies and mitigates credit assignment bias under sparse rewards via an absolute advantage-weighted episode sampling strategy. The approach formalizes tasks as Text-Mediated Stochastic Games, integrating a language-agent policy architecture with a reward attribution mechanism. Empirical evaluation on Snake and Frozen Lake benchmarks demonstrates that a 3B-parameter model, after MS-GRPO post-training, achieves a 50% performance gain over a 72B-parameter baseline on Frozen Lake. This result underscores substantial improvements in both effectiveness and scalability of small LLMs for complex sequential decision tasks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.
Problem

Research questions and friction points this paper is trying to address.

Improving small LLMs for sequential decision-making tasks
Addressing credit assignment in multi-step agentic tasks
Reducing reliance on large, computationally expensive models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MS-GRPO algorithm for multi-step credit assignment
Uses absolute-advantage-weighted episode sampling strategy
Improves smaller models to outperform larger baselines
🔎 Similar Papers
No similar papers found.