🤖 AI Summary
This work addresses the distortion in large language model (LLM) benchmark evaluations, where strong performance often stems from familiarity with task formats rather than genuine reasoning capabilities. To mitigate this bias without requiring task-specific training data, the authors propose a two-stage test-time reinforcement learning alignment method: first aligning the model to the expected output format via a single example, then constructing a reward signal through majority voting to guide the model toward the benchmark’s underlying distribution. This approach, the first to apply test-time reinforcement learning for controlling task-format familiarity bias, significantly enhances base models’ zero-shot performance on domain benchmarks and narrows the gap with fine-tuned counterparts. The results suggest that much of the apparent gain from conventional fine-tuning is attributable to format familiarity rather than improved reasoning, enabling a more equitable assessment of a model’s true capabilities.
📝 Abstract
Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL) alignment method for train-before-test. First, RL with a single sample provides a first alignment of the model to the task format, and second, test-time RL with majority-voting reward aligns the model to the benchmark distribution. Our test-time RL alignment method aligns similarly well as SFT-based train-before test, but without requiring a task-specific training set. On a domain-specific benchmark without training data, we show that direct evaluation underestimates base models which perform substantially better once aligned, yielding a more faithful evaluation of their capabilities. Moreover, for reasoning tasks, the performance gap between fine-tuned models and their base models largely disappears after alignment, suggesting that many gains from RLVR/SFT reported in the literature are not a difference in reasoning capability, but rather artifacts of task familiarity.