🤖 AI Summary
This work addresses the high latency, computational redundancy, and performance degradation in large language models (LLMs) when handling tool-use tasks that require embedding complex business rules in their entirety. To overcome this, the authors propose a multi-stage alignment framework that leverages chain-of-thought reasoning to enable the model to actively retrieve and apply relevant policies during inference without requiring the full rule context. The approach innovatively integrates a PolicyRecall reward based on Jaccard similarity and a hallucination penalty mechanism, optimized jointly via GRPO reinforcement learning to unify policy awareness with efficient reasoning. Experimental results demonstrate that the method reduces input token count by 40% while improving performance by 16 points over the baseline and outperforming same-scale in-context policy-embedding models by 3 points.
📝 Abstract
Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.