🤖 AI Summary
Current vision-language models (VLMs) struggle to capture the compositional syntactic structure of language, often treating image captions as unordered bags of words. This limits their generalization on compositional reasoning tasks requiring explicit parsing of grammatical roles (e.g., subject, predicate) and their relational dependencies.
Method: We propose a Dependency Syntax-Driven Causal Graph Model (CGM), the first to integrate causal graph structures into VLM decoding. CGM establishes a structured, partially ordered generation mechanism by explicitly modeling the core semantic causal chain via dependency syntax, thereby disentangling true causal dependencies from spurious statistical correlations.
Contribution/Results: CGM jointly optimizes the visual encoder and structured decoder. It achieves state-of-the-art performance across five compositional benchmarks—outperforming prior methods even when trained on significantly less data than large-scale baselines. These results empirically validate that causal guidance is essential for enhancing compositional generalization in VLMs.
📝 Abstract
Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a"bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, and it also improves over methods trained using much larger datasets.