🤖 AI Summary
Current large vision-language models (LVLMs) process images at the token level, resulting in low computational efficiency and a lack of human-like, concept-level understanding. To address this, we propose the first end-to-end self-supervised visual concept modeling framework that jointly integrates implicit contrastive learning with vision-language instruction tuning—requiring no concept-level annotations—to learn interpretable and transferable visual concept representations. Our method introduces a multi-instance sampling contrastive mechanism and a concept-aware visual encoder optimization. Evaluated on LLaVA-1.5-7B, it reduces FLOPs by 85% while preserving multi-task image understanding performance and significantly improving visual concept recognition accuracy. This work breaks the prevailing paradigm in LVLMs that relies solely on pixel- or patch-level modeling, establishing a novel pathway toward efficient, interpretable, and semantically grounded vision-language understanding.
📝 Abstract
Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is inefficient compared to humans who analyze information and generate content at the conceptual level, extracting relevant visual concepts with minimal effort. This inefficiency, stemming from the lack of a visual concept model, limits LVLMs' usability in real-world applications. To address this, we propose VCM, an end-to-end self-supervised visual concept modeling framework. VCM leverages implicit contrastive learning across multiple sampled instances and vision-language fine-tuning to construct a visual concept model without requiring costly concept-level annotations. Our results show that VCM significantly reduces computational costs (e.g., 85% fewer FLOPs for LLaVA-1.5-7B) while maintaining strong performance across diverse image understanding tasks. Moreover, VCM enhances visual encoders' capabilities in classic visual concept perception tasks. Extensive quantitative and qualitative experiments validate the effectiveness and efficiency of VCM.