🤖 AI Summary
Existing large vision-language models (LVLMs) struggle to leverage textual knowledge for distinguishing visually similar categories in fine-grained image recognition, while reinforcement learning–based fine-tuning with exact-match rewards tends to induce memorization and impair generalization. To address this, we propose DiVE-k: a novel framework that automatically constructs multiple-choice reasoning tasks from the model’s own top-k predictions, enabling self-supervised differential visual reasoning without external annotations. DiVE-k introduces a verifiable internal reward mechanism that explicitly encourages discrimination of subtle visual differences, circumventing the memorization bias inherent in conventional RL approaches. Evaluated on five standard fine-grained benchmarks, DiVE-k significantly outperforms Qwen2.5-VL-7B and ViRFT, achieving absolute improvements of +10.04% and +6.16% in base-to-novel generalized average accuracy, respectively. Moreover, it demonstrates robust performance under few-shot and cross-domain settings.
📝 Abstract
Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $ extbf{DiVE-k}$, $ extbf{Di}$fferential $ extbf{V}$isual r$ extbf{E}$asoning using top-$ extbf{k}$ generations, framework that leverages model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.