Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large language models (LLMs) exhibit inconsistent outputs under semantically equivalent prompts in mission-critical enterprise domains—such as finance and healthcare—undermining trustworthiness, regulatory compliance, and user experience. Existing mitigation strategies (e.g., RAG, temperature tuning) fail to guarantee output stability across equivalent prompts. This paper introduces Group Relative Policy Optimization (GRPO) for information consistency alignment—the first application of GRPO to this problem. We formalize semantically equivalent prompts as groups and jointly optimize for helpfulness and stability via context resetting and an entropy-driven composite reward. Our method explicitly models and suppresses generative variability through prompt-group modeling and context isolation. Evaluated on investment and job recommendation tasks, it significantly reduces output variation rates, outperforming both fine-tuning and decoding-based baselines. This work achieves a substantive advance in enterprise-grade information delivery stability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios-such as HR onboarding, customer support, or policy disclosure-require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to directly optimize for consistency. Unlike prior applications of GRPO, which have been limited to reasoning and code generation, we adapt GRPO to enforce stability of information content across groups of semantically equivalent prompts. We introduce entropy-based helpfulness and stability rewards, treating prompt variants as groups and resetting conversational context to isolate phrasing effects. Experiments on investment and job recommendation tasks show that our GRPO-trained model reduces variability more effectively than fine-tuning or decoding-based baselines. To our knowledge, this is a novel application of GRPO for aligning LLMs toward information consistency, reframing variability not as an acceptable feature of generative diversity but as a correctable flaw in enterprise deployments.

Problem

Research questions and friction points this paper is trying to address.

LLMs show inconsistent outputs for semantically equivalent prompts

Existing methods cannot guarantee stability across equivalent prompts

Inconsistency undermines trust in enterprise applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Relative Policy Optimization for consistency

Entropy-based rewards for helpfulness and stability

Reset conversational context to isolate phrasing effects

🔎 Similar Papers

No similar papers found.