Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding

πŸ“… 2025-05-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
DRG coding is a high-value yet severely out-of-distribution (OOD) clinical task, where existing large language models (LLMs) exhibit limited performance due to the absence of private healthcare data in their pretraining. Method: We propose DRG-Sapphireβ€”a novel reinforcement learning (RL)-enhanced framework tailored for medical OOD tasks. It introduces Group Relative Policy Optimization (GRPO), an RL algorithm explicitly aligned with clinical grouping logic; a rule-driven reward model grounded in DRG classification guidelines; and an analysis revealing a scaling law linking RL performance to base model knowledge capacity. Contribution/Results: Evaluated on MIMIC-IV using Qwen2.5-7B, DRG-Sapphire achieves state-of-the-art accuracy. It generates physician-validated, interpretable reasoning chains for DRG assignment. Empirical analysis further demonstrates that the scale of multi-stage supervised fine-tuning (SFT) critically determines downstream RL effectiveness.

Technology Category

Application Category

πŸ“ Abstract
Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.
Problem

Research questions and friction points this paper is trying to address.

Automating labor-intensive DRG code assignment from clinical notes
Addressing LLMs' out-of-distribution reasoning challenges in medical coding
Enhancing explainability and accuracy in domain-specific RL applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large-scale reinforcement learning for DRG coding
Implements Group Relative Policy Optimization with rule-based rewards
Scales supervised fine-tuning before reinforcement learning
πŸ”Ž Similar Papers
No similar papers found.
Hanyin Wang
Hanyin Wang
Mayo Clinic Health System, University of Illinois Urbana-Champaign
LLMs for Healthcare
Z
Zhenbang Wu
University of Illinois Urbana-Champaign
G
G. Kolar
Mayo Clinic Rochester
H
H. Korsapati
Mayo Clinic Health System
B
Brian Bartlett
Mayo Clinic Health System
B
Bryan Hull
Mayo Clinic Phoenix
Jimeng Sun
Jimeng Sun
Professor at University of Illinois Urbana-Champaign
AI for healthcareMachine learning for healthcaredeep learning for healthcare