Can Visual Encoder Learn to See Arrows?

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Vision-language models (VLMs) struggle to comprehend structured visual relationships—such as arrows and connecting lines—due to insufficient edge representation capability in their image encoders, primarily stemming from pervasive text and positional biases in training data. Method: We introduce the first synthetic chart-text dataset that is both text-free and position-bias-free, coupled with contrastive learning and targeted image encoder fine-tuning to explicitly encourage learning of edge-structured features. Contribution/Results: Through probe analysis, cross-modal retrieval, and chart captioning evaluation, we demonstrate that the fine-tuned encoder significantly outperforms CLIP on edge detection probing tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral on chart description. This work constitutes the first systematic identification and mitigation of the edge-perception bottleneck in VLMs for structured visual relation modeling, establishing a new paradigm for interpretable and robust vision-language understanding.

Technology Category

Application Category

📝 Abstract

The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram--caption dataset to train an image encoder and evaluate its diagram-related features on three tasks: probing, image retrieval, and captioning. Our results show that the finetuned model outperforms pretrained CLIP in all tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral in the captioning task. These findings confirm that eliminating textual and positional biases fosters accurate edge recognition in VLMs, offering a promising path for advancing diagram understanding.

Problem

Research questions and friction points this paper is trying to address.

Can VLMs learn to recognize edges in diagrams

Overcoming textual and positional biases in edge recognition

Improving diagram understanding via contrastive learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning on artificial diagram dataset

Eliminating textual and positional biases

Finetuned model outperforms CLIP and GPT-4o

🔎 Similar Papers

No similar papers found.

Authors to Follow