TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing trajectory-level reinforcement learning struggles to achieve fine-grained turn-level optimization in iterative reasoning tasks, while black-box approaches overlook the reasoning capabilities and prior knowledge of large language models (LLMs). This work proposes TL-GRPO, the first algorithm to introduce turn-level reinforcement learning into LLM-driven iterative optimization. By employing a turn-level grouped sampling mechanism that maintains a fixed environment state across multiple interaction rounds, TL-GRPO directly optimizes intermediate steps to maximize per-turn reward. Built upon a lightweight GRPO framework and integrating LLMs with tool calling, the method significantly outperforms standard GRPO and Bayesian optimization on analog circuit sizing tasks, with a 30B-parameter model achieving state-of-the-art performance under identical simulation budgets.

Technology Category

Application Category

📝 Abstract
Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.
Problem

Research questions and friction points this paper is trying to address.

iterative optimization
turn-level optimization
reasoning-guided
trajectory-level RL
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Turn-Level RL
Iterative Optimization
TL-GRPO
Reasoning-Guided Optimization
Analog Circuit Sizing
🔎 Similar Papers
No similar papers found.
Peiji Li
Peiji Li
Fudan University
L
Linyang Li
Shanghai AI Laboratory
H
Handa Sun
Fudan University
W
Wenjin Mai
Fudan University
Y
Yongkang Chen
Shanghai AI Laboratory
X
Xiaozhe Li
Shanghai AI Laboratory
Y
Yue Shen
Shanghai AI Laboratory
Yichuan Ma
Yichuan Ma
Fudan University
LLMSynthetic Data
Y
Yiliu Sun
Shanghai AI Laboratory
J
Jiaxi Cao
Shanghai AI Laboratory
Z
Zhishu He
Shanghai AI Laboratory
B
Bo Wang
Fudan University
Xiaoqing Zheng
Xiaoqing Zheng
Fudan University
Natural Language Processing and Machine Learning
Zhaori Bi
Zhaori Bi
Fudan University
Analog Circuit Design AutomationElectrical Design AutomationMedical AI
X
Xipeng Qiu
Fudan University
Qipeng Guo
Qipeng Guo
Fudan University
Kai Chen
Kai Chen
Shanghai AI Laboratory
LLMVLMComputer Vision
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics