Do Vision-Language Models Really Understand Visual Language?

📅 2024-09-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work investigates large vision-language models’ (LVLMs) deep understanding of conceptual entities and relational structures in charts. Addressing the lack of symbolic reasoning and compositional inference evaluation in existing benchmarks, we introduce a cross-domain chart understanding benchmark encompassing both synthetic and real-world scenarios. Our methodology comprises three components: (1) controllable synthetic chart generation, (2) multi-granularity relational question answering, and (3) knowledge-isolation ablation experiments. We systematically evaluate leading LVLMs—including LLaVA and Qwen-VL—under these conditions. Our findings reveal a fundamental limitation: while entity recognition accuracy remains high, relational reasoning error rates exceed 62%; critically, performance collapses when prior knowledge is isolated, indicating heavy reliance on external background knowledge rather than robust vision–language joint semantic grounding. This challenges prevailing assumptions about LVLMs’ “true understanding” of charts and establishes a new evaluation paradigm for trustworthy chart AI, with concrete implications for model diagnosis and architectural improvement.

Technology Category

Application Category

📝 Abstract

Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across domains to evaluate the recognition and reasoning abilities of models. Our evaluation of LVLMs shows that while they can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.

Problem

Research questions and friction points this paper is trying to address.

Large Visual Language Models

Chart Symbolism Understanding

Complex Ideas and Relationships

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Visual Language Models

Chart Understanding

Background Knowledge Utilization

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions