Can Visual Encoder Learn to See Arrows?

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) struggle to comprehend structured visual relationships—such as arrows and connecting lines—due to insufficient edge representation capability in their image encoders, primarily stemming from pervasive text and positional biases in training data. Method: We introduce the first synthetic chart-text dataset that is both text-free and position-bias-free, coupled with contrastive learning and targeted image encoder fine-tuning to explicitly encourage learning of edge-structured features. Contribution/Results: Through probe analysis, cross-modal retrieval, and chart captioning evaluation, we demonstrate that the fine-tuned encoder significantly outperforms CLIP on edge detection probing tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral on chart description. This work constitutes the first systematic identification and mitigation of the edge-perception bottleneck in VLMs for structured visual relation modeling, establishing a new paradigm for interpretable and robust vision-language understanding.

Technology Category

Application Category

📝 Abstract
The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram--caption dataset to train an image encoder and evaluate its diagram-related features on three tasks: probing, image retrieval, and captioning. Our results show that the finetuned model outperforms pretrained CLIP in all tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral in the captioning task. These findings confirm that eliminating textual and positional biases fosters accurate edge recognition in VLMs, offering a promising path for advancing diagram understanding.
Problem

Research questions and friction points this paper is trying to address.

Can VLMs learn to recognize edges in diagrams
Overcoming textual and positional biases in edge recognition
Improving diagram understanding via contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning on artificial diagram dataset
Eliminating textual and positional biases
Finetuned model outperforms CLIP and GPT-4o
🔎 Similar Papers
No similar papers found.
N
Naoyuki Terashita
Hitachi, Ltd.
Y
Yusuke Tozaki
Hitachi, Ltd., Kyoto Sangyo University
H
Hideaki Omote
Hitachi, Ltd., Gifu University
C
Congkha Nguyen
Hitachi, Ltd.
Ryosuke Nakamoto
Ryosuke Nakamoto
Hitachi, Ltd.
Yuta Koreeda
Yuta Koreeda
Hitachi, Ltd., Hitachi America, Ltd., Stanford CS
natural language processingmachine learningrobotcomputer assisted surgery
H
Hiroaki Ozaki
Hitachi, Ltd.