Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

📅 2026-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models exhibit limited compositional understanding, often failing to properly model object relations, attribute-object bindings, and word-order dependencies, thereby displaying “bag-of-words” behavior. To address this, this work proposes the MACCO framework, which enhances cross-modal compositional alignment by masking compositional concepts in one modality and reconstructing them conditioned on the full context of the other modality. MACCO integrates cross-modal and intra-modal joint alignment, contrastive learning, and auxiliary regularization objectives to effectively uncover fine-grained compositional information within image–text pairs, overcoming the limitations of global single-vector representations. Experiments demonstrate that MACCO achieves significant performance gains across five compositional benchmarks, substantially improving the model’s ability to capture syntactic structures and linguistic nuances, while also enhancing text-to-image generation and multimodal large language model capabilities.
📝 Abstract
Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.
Problem

Research questions and friction points this paper is trying to address.

compositional understanding
vision-language models
cross-modal compositionality
object relations
attribute-object binding
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal compositionality
masked concept modeling
visio-linguistic alignment
compositional reasoning
vision-language models
🔎 Similar Papers
No similar papers found.