Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high cost and expert dependency associated with constructing high-quality rubrics for evaluating large language models in healthcare, which hinders scalable and safe model development. To overcome this limitation, we propose Health-SCORE, a novel framework that enables low-cost, scalable automatic generation of evaluation rubrics. Health-SCORE unifies rubric application across structured assessment, safety-aware reinforcement learning reward modeling, and in-context learning prompt design. Experimental results demonstrate that Health-SCORE achieves evaluation quality comparable to human-crafted rubrics across multiple open-ended medical tasks, significantly enhancing the scalability and safety of training and evaluating health-focused large language models.

Technology Category

Application Category

📝 Abstract

Rubrics are essential for evaluating open-ended LLM responses, especially in safety-critical domains such as healthcare. However, creating high-quality and domain-specific rubrics typically requires significant human expertise time and development cost, making rubric-based evaluation and training difficult to scale. In this work, we introduce Health-SCORE, a generalizable and scalable rubric-based training and evaluation framework that substantially reduces rubric development costs without sacrificing performance. We show that Health-SCORE provides two practical benefits beyond standalone evaluation: it can be used as a structured reward signal to guide reinforcement learning with safety-aware supervision, and it can be incorporated directly into prompts to improve response quality through in-context learning. Across open-ended healthcare tasks, Health-SCORE achieves evaluation quality comparable to human-created rubrics while significantly lowering development effort, making rubric-based evaluation and training more scalable.

Problem

Research questions and friction points this paper is trying to address.

rubrics

healthcare

large language models

evaluation scalability

domain-specific assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-based evaluation

scalable training

healthcare LLMs

reinforcement learning with human feedback

in-context learning

🔎 Similar Papers

No similar papers found.

Authors to Follow