Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges in dementia care posed by sparse rewards, which hinder fine-grained credit assignment, and the high cost and potential inaccuracies of external large language model–based evaluation. To this end, the authors propose the T²-GRPO framework, which decouples reinforcement learning into episode-level and trajectory-level reward horizons. It leverages a frozen patient simulator to intrinsically generate dense episode-level rewards from environmental state transitions and enforces binary hard safety constraints to ensure caregiver actions remain safe. Heterogeneous rewards are integrated via independent centralized ranking normalization, effectively preventing reward collapse and jointly optimizing immediate patient feedback with long-term care objectives. Experimental results demonstrate that T²-GRPO significantly outperforms existing baselines across multiple metrics, achieving a strong balance among efficacy, responsiveness, and safety.

📝 Abstract

Optimizing large language models (LLMs) for long-horizon caregiver agents requires balancing delayed task objectives with immediate environment dynamics, such as patient distress and resistance. In dementia care, this balance is especially difficult: trajectory level rewards are too sparse for turn level credit assignment, while external LLM-based evaluators are costly and can misread fragmented or indirect patient responses. To address this issue, we propose \textbf{T}urn-\textbf{T}rajectory \textbf{G}roup \textbf{R}elative \textbf{P}olicy \textbf{O}ptimization (\textbf{T$^{2}$-GRPO}), a framework that decouples caregiver RL into two normalized reward horizons and enforces safety through a binary hard veto. $T^2$-GRPO derives dense turn-level rewards directly from environment state transitions, measuring changes in patient distress and resistance from a frozen dementia patient simulator. These environment-grounded rewards are combined with trajectory-level evaluations through independent centered-rank normalization, which preserves heterogeneous reward signals and mitigates reward collapse. Extensive experiments on dementia caregivers show that T $^{2}$-GRPO outperforms competitive baselines, indicating a substantial improvement for emotionally sensitive caregiver scenarios that effectively handles immediate patient feedback, long-term care outcomes, and safety constraints.

Problem

Research questions and friction points this paper is trying to address.

caregiver agents

reward sparsity

credit assignment

patient distress

long-horizon tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

T²-GRPO

turn-trajectory reward decoupling

environment-grounded rewards