SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer

📅 2024-10-28
🏛️ ACM Multimedia
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of low detection accuracy and efficiency caused by extreme object sparsity and severe scale variation in high-resolution wide-angle (HRW) images, this paper proposes a model-agnostic sparse Vision Transformer architecture. Our method introduces three key innovations: (1) a selective token activation mechanism that models only candidate object-containing windows, drastically reducing computational redundancy; (2) cross-slice non-maximum suppression (C-NMS) to mitigate boundary artifacts induced by image tiling; and (3) coarse-to-fine attention collaboration with multi-scale feature fusion to enhance small-object perception. Evaluated on PANDA and DOTA-v1.0 benchmarks, our approach achieves up to a 5.8% improvement in mean Average Precision (mAP) while attaining inference speed three times faster than current state-of-the-art methods. To the best of our knowledge, this is the first work to achieve both high accuracy and high efficiency for sparse object detection in HRW scenarios.

Technology Category

Application Category

📝 Abstract
Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.
Problem

Research questions and friction points this paper is trying to address.

Detect objects in high-resolution wide shots
Address extreme sparsity and scale changes
Improve accuracy and speed in object detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Vision Transformer
Cross-slice NMS algorithm
Multi-scale feature fusion
🔎 Similar Papers
No similar papers found.
W
Wenxi Li
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Yuchen Guo
Yuchen Guo
Tsinghua University
Machine LearningComputer VisionInformation Retrieval
J
Jilai Zheng
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Haozhe Lin
Haozhe Lin
Tsinghua
C
Chao Ma
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
L
Lu Fang
Department of Electronic Engineering, BNRist, Tsinghua University, Beijing, China
X
Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China