The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models lack systematic evaluation of fine-grained visual perception, making it difficult to determine the smallest visual patterns they can reliably recognize. To address this gap, this work introduces FineSightBench, a benchmark that isolates and quantifies pixel-level perception and small-scale reasoning capabilities across a controlled range of 4–48 pixels. By leveraging synthetically generated images at precise scales and evaluating performance on pixel-level identification, spatial reasoning, and counting tasks—complemented by failure mode analysis—the study systematically assesses prominent models. Findings reveal that perceptual capability saturates around 12 pixels, whereas reasoning remains significantly limited even at larger scales, frequently exhibiting numerical and sequential errors. This discrepancy highlights a clear dissociation between perception and reasoning in current vision-language models.

📝 Abstract

Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a visual pattern can a VLM reliably perceive? As such, we introduce FineSightBench, a new benchmark that systematically probes this limit by separating perception tasks (pixel-level recognition of letters, shapes, objects) from reasoning tasks (spatial reasoning, counting, ordering over small targets) across controlled scales of 4--48px. Through comprehensive experiments and detailed failure mode analysis on state-of-the-art models, we reveal a sharp dissociation: perception saturates around 12px, while reasoning remains limited even at larger scales, with persistent numeracy and sequence errors. These findings expose fundamental deficiencies in VLMs' fine-scale visual reasoning that demand more rigorous evaluation.

Problem

Research questions and friction points this paper is trying to address.

fine-grained visual perception

vision-language models

pixel-level recognition

visual reasoning

small-scale visual patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained perception

vision-language models

FineSightBench