Stateful Visual Encoders for Vision-Language Models

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses a key limitation of current vision-language models: their visual encoders lack a state mechanism, hindering the perception of subtle changes across images. To overcome this, the paper introduces the first stateful visual encoder, which integrates historical visual features during the encoding of the current image, thereby endowing the model with cross-frame contextual awareness and departing from the conventional paradigm of independent frame encoding. The proposed architecture employs history-conditioned feature encoding combined with supervised fine-tuning, enabling stable training across varying input resolutions and model scales. Empirical results demonstrate consistent performance gains on tasks such as spatial aggregation, multi-object difference detection, and trajectory behavior cloning, with the method outperforming both general-purpose baselines and specialized models in real-world applications including radiology and remote sensing.

📝 Abstract

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/

Problem

Research questions and friction points this paper is trying to address.

vision-language models

stateless visual encoder

visual context

visual changes

multi-image tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stateful Visual Encoder

Vision-Language Models

Visual Temporal Context