GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

📅 2024-04-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) trained from scratch on small-scale datasets underperform convolutional neural networks (CNNs), primarily due to their lack of inductive bias for local spatial structure. Method: We propose Graph-ViT, a graph-structured Vision Transformer that (i) jointly models local spatial adjacency via novel graph convolutional projection and graph pooling; (ii) introduces a sparsely selected bilinear talking-heads mechanism to overcome the low-rank limitation of standard multi-head self-attention; and (iii) deeply integrates self-attention with graph neural network layers. Contribution/Results: When trained from scratch on small-scale benchmarks—including ImageNet-1K—Graph-ViT significantly outperforms baseline ViTs and matches or exceeds the performance of deep CNNs (e.g., ResNet-50), demonstrating the efficacy of graph-structured priors for sample-efficient ViT training. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Vision Transformers (ViTs) have achieved impressive results in large-scale image classification. However, when training from scratch on small datasets, there is still a significant performance gap between ViTs and Convolutional Neural Networks (CNNs), which is attributed to the lack of inductive bias. To address this issue, we propose a Graph-based Vision Transformer (GvT) that utilizes graph convolutional projection and graph-pooling. In each block, queries and keys are calculated through graph convolutional projection based on the spatial adjacency matrix, while dot-product attention is used in another graph convolution to generate values. When using more attention heads, the queries and keys become lower-dimensional, making their dot product an uninformative matching function. To overcome this low-rank bottleneck in attention heads, we employ talking-heads technology based on bilinear pooled features and sparse selection of attention tensors. This allows interaction among filtered attention scores and enables each attention mechanism to depend on all queries and keys. Additionally, we apply graph-pooling between two intermediate blocks to reduce the number of tokens and aggregate semantic information more effectively. Our experimental results show that GvT produces comparable or superior outcomes to deep convolutional networks and surpasses vision transformers without pre-training on large datasets. The code for our proposed model is publicly available on the website.
Problem

Research questions and friction points this paper is trying to address.

Bridges performance gap between ViTs and CNNs on small datasets
Overcomes low-rank bottleneck in attention heads via talking-heads
Enhances token aggregation with graph-pooling for better semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based Vision Transformer with graph convolution
Talking-heads technology using bilinear pooling
Graph-pooling for token reduction and aggregation
🔎 Similar Papers
No similar papers found.
D
Dongjing Shan
Southwest Medical University
G
guiqiang chen