Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the long-tailed and zero-shot generalization challenges in human-object interaction detection (HOID) arising from sparse annotations, this paper proposes a top-down zero-shot HOID framework: it first localizes objects and then dynamically associates actions with them via multimodal cues—visual, linguistic, and relational. Innovatively, HOI-specific priors are injected into the encoder; an asymmetric co-attention mechanism is designed to model bidirectional object–action dependencies; and a relation-aware loss function is introduced to enforce semantic consistency. The method adopts an end-to-end Transformer architecture that unifies zero-shot learning with structured relational modeling. Evaluated on HICO-DET and V-COCO, it achieves state-of-the-art performance, improving average precision (AP) on unseen rare interaction categories by 12.4% and 8.4%, respectively. This demonstrates substantial gains in scene understanding and generalization capability under data-scarce conditions.

Technology Category

Application Category

📝 Abstract
Human-object interaction detection (HOID) refers to localizing interactive human-object pairs in images and identifying the interactions. Since there could be an exponential number of object-action combinations, labeled data is limited - leading to a long-tail distribution problem. Recently, zero-shot learning emerged as a solution, with end-to-end transformer-based object detectors adapted for HOID becoming successful frameworks. However, their primary focus is designing improved decoders for learning entangled or disentangled interpretations of interactions. We advocate that HOI-specific cues must be anticipated at the encoder stage itself to obtain a stronger scene interpretation. Consequently, we build a top-down framework named Funnel-HOI inspired by the human tendency to grasp well-defined concepts first and then associate them with abstract concepts during scene understanding. We first probe an image for the presence of objects (well-defined concepts) and then probe for actions (abstract concepts) associated with them. A novel asymmetric co-attention mechanism mines these cues utilizing multimodal information (incorporating zero-shot capabilities) and yields stronger interaction representations at the encoder level. Furthermore, a novel loss is devised that considers objectaction relatedness and regulates misclassification penalty better than existing loss functions for guiding the interaction classifier. Extensive experiments on the HICO-DET and V-COCO datasets across fully-supervised and six zero-shot settings reveal our state-of-the-art performance, with up to 12.4% and 8.4% gains for unseen and rare HOI categories, respectively.
Problem

Research questions and friction points this paper is trying to address.

Detecting human-object interactions with zero-shot learning
Addressing long-tail distribution in HOI detection data
Enhancing encoder-level interaction representations for better scene interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Top-down framework for HOI detection
Asymmetric co-attention for multimodal cues
Novel loss function for better classification
🔎 Similar Papers
No similar papers found.