🤖 AI Summary
DETR achieves end-to-end object detection but suffers from inefficient redundant competition among learnable queries in its decoder. To address this, we propose an adaptive pairwise query routing mechanism—the first to introduce dual asymmetric routing paths: suppression and delegation—guided by inter-query similarity, confidence scores, and geometric relationships. We further enhance self-attention modeling via a learnable low-rank attention bias and adopt a two-branch training strategy to optimize routing decisions. Crucially, the method incurs zero inference overhead—no additional computation is required during deployment. On COCO, our approach improves the ResNet-50-based DINO baseline by +1.7% mAP; on Cityscapes, it achieves 57.6% mAP using a Swin-L backbone, surpassing prior state-of-the-art. Our core contribution is the first structured query-routing framework explicitly designed to mitigate query competition, achieving simultaneous gains in both efficiency and accuracy.
📝 Abstract
Detection Transformer (DETR) offers an end-to-end solution for object detection by eliminating hand-crafted components like non-maximum suppression. However, DETR suffers from inefficient query competition where multiple queries converge to similar positions, leading to redundant computations. We present Route-DETR, which addresses these issues through adaptive pairwise routing in decoder self-attention layers. Our key insight is distinguishing between competing queries (targeting the same object) versus complementary queries (targeting different objects) using inter-query similarity, confidence scores, and geometry. We introduce dual routing mechanisms: suppressor routes that modulate attention between competing queries to reduce duplication, and delegator routes that encourage exploration of different regions. These are implemented via learnable low-rank attention biases enabling asymmetric query interactions. A dual-branch training strategy incorporates routing biases only during training while preserving standard attention for inference, ensuring no additional computational cost. Experiments on COCO and Cityscapes demonstrate consistent improvements across multiple DETR baselines, achieving +1.7% mAP gain over DINO on ResNet-50 and reaching 57.6% mAP on Swin-L, surpassing prior state-of-the-art models.