🤖 AI Summary
This work addresses the limitations of current vision-language models in spatially constrained object counting tasks, where real-world data often suffer from occlusion and structural ambiguity, leading to a scarcity of high-quality, controllable training samples. To bridge the gap between simplistic 2D datasets and complex real-world scenes, we introduce SITUATE—a novel object counting dataset built on controllable synthetic scenes that precisely model spatial layouts and occlusion relationships. Leveraging advanced synthetic image generation techniques, we construct counting instances with explicit spatial constraints and fine-tune models such as Qwen-VL 2.5 7B on this data. Experimental results demonstrate that models trained on SITUATE significantly outperform those fine-tuned on real-world datasets of comparable scale, achieving superior generalization on out-of-distribution benchmarks like Pixmo Count.
📝 Abstract
We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models on counting tasks with spatial constraints. The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA, which lack control over occlusions and spatial composition. Experiments show that our dataset helps to improve generalization for out-of-distribution images, since a finetune of Qwen VL 2.5 7B on SITUATE improves accuracy on the Pixmo count test data, but not vice versa. We cross validate this by comparing the model performance across established other counting benchmarks and against an equally sized fine-tuning set derived from Pixmo count.