🤖 AI Summary
This work addresses the challenge of generating chest X-ray reports, where traditional reinforcement learning rewards—such as exact string matching—are ill-suited due to the unordered and orthogonal nature of medical findings. The authors propose representing reports as unordered sets of sentence embeddings and introduce set-based distances, notably the Chamfer distance, as permutation-invariant continuous reward signals. Integrated with the GRPO algorithm, this approach enables consistent optimization across both training and testing phases. The method supports post-training fine-tuning and efficient candidate selection during inference, significantly outperforming supervised fine-tuning and exact-match-based GRPO, with average relative improvements of 6.80%, 7.82%, and 4.45% on BERTScore, RadGraph F1, and CheXbert F1, respectively. A streaming pruning strategy at test time reduces generated tokens by over 50% while preserving report quality.
📝 Abstract
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-$N$ selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-$N$ selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{available}.