DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Reward models are prone to relying on superficial shortcut cues during training, leading to fragile preference modeling. To address this issue, this work proposes DynaCF, a novel framework that introduces, for the first time, a dynamic counterfactual sensitivity measurement mechanism. During training, DynaCF continuously evaluates the sensitivity of each sample pair to counterfactual perturbations that preserve semantic content and dynamically down-weights highly sensitive samples in the Bradley–Terry objective. This approach enables online identification and suppression of shortcut learning by integrating counterfactual perturbation generation, dynamic reweighting, and real-time sensitivity tracking. Extensive experiments demonstrate that DynaCF significantly enhances the robustness of reward models and effectively mitigates their dependence on spurious correlations.

📝 Abstract

Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and preference flips under the current model. Samples with higher shortcut sensitivity are dynamically downweighted in the Bradley-Terry objective, encouraging the model to rely less on superficial patterns and more on task-relevant preference signals. Extensive experiments show that DynaCF consistently improves robustness in preference modeling.

Problem

Research questions and friction points this paper is trying to address.

shortcut learning

reward models

preference modeling

counterfactual sensitivity

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

shortcut learning

counterfactual perturbation

dynamic reweighting