SmoGVLM: A Small, Graph-enhanced Vision-Language Model

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the susceptibility of large vision-language models to hallucination and their limited semantic grounding in knowledge-intensive reasoning tasks. To overcome these limitations, the authors propose a novel approach that, for the first time, integrates graph neural networks into a small-scale vision-language model to effectively incorporate structured external knowledge and enhance multimodal reasoning capabilities. Remarkably, the proposed method achieves performance gains of up to 16.24% on a compact model with only 1.3 billion parameters, substantially outperforming both larger 13-billion-parameter models and strong fine-tuned baselines. These results demonstrate the efficacy and potential of structured knowledge infusion in enabling efficient yet highly accurate multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) achieve strong performance on multimodal tasks but often suffer from hallucination and poor grounding in knowledge-intensive reasoning. We propose SmoGVLM, a small, graph-enhanced VLM that integrates structured knowledge with visual and textual modalities, using Graph Neural Networks. We investigate the effects of our method across a range of model sizes, from tiny (1.3B) to large (13B) models. Our results demonstrate that, when trained using our approach, a small model can achieve performance gains upto 16.24%, and surpass its larger counterparts, outperforming larger VLMs and strong fine-tuned baselines. These results highlight the potential of structured knowledge augmentation for efficient, smaller-scale multimodal reasoning systems.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

hallucination

knowledge grounding

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-enhanced

vision-language model

structured knowledge

Graph Neural Networks