DS-Det: Single-Query Paradigm and Attention Disentangled Learning for Flexible Object Detection

📅 2025-07-26

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

Query-based Transformer detectors suffer from three key limitations: (1) a fixed number of learnable queries restricts adaptability to varying object counts; (2) recurrent adversarial interaction (ROT) between self-attention and cross-attention degrades decoding efficiency; and (3) shared-weight decoder layers jointly handle one-to-many localization and one-to-one deduplication, causing “query ambiguity” that violates DETR’s bipartite matching principle. To address these, we propose DS-Det: a detector employing a single-query paradigm for adaptive object count estimation; decoupling attention mechanisms—cross-attention for one-to-many localization and self-attention for one-to-one deduplication; and introducing the PoCoo loss, which incorporates box-size priors to enhance small-object learning. Extensive experiments on COCO2017 and WiderPerson with five backbone architectures demonstrate consistent and significant improvements over baseline methods, validating DS-Det’s generality and state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Popular transformer detectors have achieved promising performance through query-based learning using attention mechanisms. However, the roles of existing decoder query types (e.g., content query and positional query) are still underexplored. These queries are generally predefined with a fixed number (fixed-query), which limits their flexibility. We find that the learning of these fixed-query is impaired by Recurrent Opposing inTeractions (ROT) between two attention operations: Self-Attention (query-to-query) and Cross-Attention (query-to-encoder), thereby degrading decoder efficiency. Furthermore, "query ambiguity" arises when shared-weight decoder layers are processed with both one-to-one and one-to-many label assignments during training, violating DETR's one-to-one matching principle. To address these challenges, we propose DS-Det, a more efficient detector capable of detecting a flexible number of objects in images. Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling, transforming the fixed-query into flexible. Furthermore, we propose a simplified decoder framework through attention disentangled learning: locating boxes with Cross-Attention (one-to-many process), deduplicating predictions with Self-Attention (one-to-one process), addressing "query ambiguity" and "ROT" issues directly, and enhancing decoder efficiency. We further introduce a unified PoCoo loss that leverages box size priors to prioritize query learning on hard samples such as small objects. Extensive experiments across five different backbone models on COCO2017 and WiderPerson datasets demonstrate the general effectiveness and superiority of DS-Det. The source codes are available at https://github.com/Med-Process/DS-Det/.

Problem

Research questions and friction points this paper is trying to address.

Addresses fixed-query limitations in transformer detectors

Resolves Recurrent Opposing Interactions in attention mechanisms

Eliminates query ambiguity in shared-weight decoder layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-Query paradigm for flexible detection

Attention disentangled learning for efficiency

Unified PoCoo loss prioritizes hard samples

🔎 Similar Papers

A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training