TTRV: Test-Time Reinforcement Learning for Vision Language Models

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing vision-language models (VLMs) rely heavily on labeled data and handcrafted reward functions for inference-time optimization, contradicting human unsupervised learning in natural environments. To address this, we propose TTRV—the first test-time reinforcement learning framework tailored for VLMs—operating entirely without annotated data. During inference, TTRV constructs self-supervised reward signals from the frequency distribution of sampled outputs and incorporates low-entropy regularization to refine the policy distribution. Built upon Group Relative Policy Optimization, it enables fully unsupervised, online, and dynamic inference-time adaptation. Evaluated across 16 benchmarks, TTRV achieves average improvements of +24.6% in image recognition accuracy and +10.0% in VQA performance, with maximum gains reaching +52.4% and +29.8%, respectively. Notably, InternVL-8B enhanced by TTRV surpasses GPT-4o by 2.3% on image recognition.

Technology Category

Application Category

📝 Abstract

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing vision language models without labeled training data

Improving object recognition and visual question answering performance

Adapting models during inference using test-time reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses test-time reinforcement learning without labeled data

Rewards based on output frequency and low entropy

Enhances vision language models during inference phase

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling