Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current vision-language models exhibit limited compositional understanding, often failing to properly model object relations, attribute-object bindings, and word-order dependencies, thereby displaying “bag-of-words” behavior. To address this, this work proposes the MACCO framework, which enhances cross-modal compositional alignment by masking compositional concepts in one modality and reconstructing them conditioned on the full context of the other modality. MACCO integrates cross-modal and intra-modal joint alignment, contrastive learning, and auxiliary regularization objectives to effectively uncover fine-grained compositional information within image–text pairs, overcoming the limitations of global single-vector representations. Experiments demonstrate that MACCO achieves significant performance gains across five compositional benchmarks, substantially improving the model’s ability to capture syntactic structures and linguistic nuances, while also enhancing text-to-image generation and multimodal large language model capabilities.

📝 Abstract

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

Problem

Research questions and friction points this paper is trying to address.

compositional understanding

vision-language models

cross-modal compositionality

object relations

attribute-object binding

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal compositionality

masked concept modeling

visio-linguistic alignment