TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing trajectory-level reinforcement learning struggles to achieve fine-grained turn-level optimization in iterative reasoning tasks, while black-box approaches overlook the reasoning capabilities and prior knowledge of large language models (LLMs). This work proposes TL-GRPO, the first algorithm to introduce turn-level reinforcement learning into LLM-driven iterative optimization. By employing a turn-level grouped sampling mechanism that maintains a fixed environment state across multiple interaction rounds, TL-GRPO directly optimizes intermediate steps to maximize per-turn reward. Built upon a lightweight GRPO framework and integrating LLMs with tool calling, the method significantly outperforms standard GRPO and Bayesian optimization on analog circuit sizing tasks, with a 30B-parameter model achieving state-of-the-art performance under identical simulation budgets.

Technology Category

Application Category

📝 Abstract

Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.

Problem

Research questions and friction points this paper is trying to address.

iterative optimization

turn-level optimization

reasoning-guided

trajectory-level RL

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Turn-Level RL

Iterative Optimization

TL-GRPO

Reasoning-Guided Optimization

Analog Circuit Sizing

🔎 Similar Papers

No similar papers found.

Authors to Follow