Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing text-to-image reward models in capturing the uncertainty and fine-grained nuances of subjective visual preferences, as well as the high deployment cost of inference-based reward methods. The authors propose Z-Reward, a teacher–student framework that decouples complex reasoning from efficient deployment: a teacher model leverages a large vision-language model to predict score distributions accompanied by reasoning chains, while a student model internalizes this reasoning capability through distillation, enabling efficient scoring without explicit inference. The approach innovatively integrates score distributions with reasoning processes and introduces Group-wise Direct Score Optimization (GDSO) for training the teacher and Reasoning-Internalized Score Distillation (RISD) for distilling the student. Experiments show that the 27B-parameter teacher achieves 89.6% accuracy on human preference judgments, the 9B-parameter student reaches 88.6%, and the framework yields a 41.3% net gain in human preference when applied to image generation optimization.

📝 Abstract

Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.

Problem

Research questions and friction points this paper is trying to address.

reward modeling

text-to-image generation

subjective preference

score distribution

reasoning-based reward

Innovation

Methods, ideas, or system contributions that make the work stand out.

score distribution

reward modeling

reasoning distillation