From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual language models (VLMs) suffer from overfitting and imbalanced performance during fine-tuning due to real-world data biases, annotation noise, and class distribution skew. Method: This paper proposes a controllable synthetic data generation framework that constructs unbiased, attribute-balanced synthetic scene datasets by decoupling sampling across spatial position, color, shape, and size, coupled with precise programmatic annotation—thereby eliminating distributional bias and human labeling errors while enabling cross-domain transfer. Contribution/Results: Experiments on real-world benchmarks—including COCO—demonstrate that VLMs fine-tuned on our synthetic data significantly outperform those trained via conventional fine-tuning on absolute positional reasoning tasks. The proposed approach yields higher overall accuracy and more uniform performance across spatial relations, validating the effectiveness of synthetic-data-driven generalization for spatial reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects'attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.
Problem

Research questions and friction points this paper is trying to address.

Addressing bias and imbalance in VLM fine-tuning with synthetic data
Controlling data generation and annotation quality to prevent overfitting
Enhancing spatial reasoning transfer from synthetic to real-world performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled synthetic data generation to eliminate bias
Comprehensive attribute sampling for balanced dataset construction
Fine-tuning VLMs on synthetic data for real-world transfer
🔎 Similar Papers
No similar papers found.
M
Massimo Rizzoli
Signals and Interactive Systems Lab, University of Trento, Italy
S
Simone Alghisi
Signals and Interactive Systems Lab, University of Trento, Italy
S
Seyed Mahed Mousavi
Signals and Interactive Systems Lab, University of Trento, Italy
Giuseppe Riccardi
Giuseppe Riccardi
Professor of Computer Science, University of Trento Italy
Natural Language ProcessingSpeech ProcessingDialogueMachine Learning