Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing open-vocabulary scene graph generation (OVSGG) methods suffer from ambiguous intra-class object interactions due to the absence of explicit interaction modeling, leading to high pseudo-supervision noise during knowledge injection and ambiguous query matching in knowledge transfer. To address this, we propose ACC, an interaction-centric framework that introduces the novel *interaction-centered paradigm*. ACC generates high-quality pseudo-labels via bidirectional interaction prompting, designs an interaction-aware query selection mechanism, and incorporates consistency-aware knowledge distillation. By synergistically integrating large-scale pretrained vision-language models with end-to-end training, ACC effectively mitigates interaction ambiguity among semantically similar objects. Evaluated on three standard benchmarks, ACC achieves state-of-the-art performance, demonstrating that explicit interaction modeling critically enhances fine-grained relational understanding and open-vocabulary generalization capability.

Technology Category

Application Category

📝 Abstract

Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) extit{Infusing knowledge} into large-scale models via pre-training on large datasets; 2) extit{Transferring knowledge} from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an inter extbf{AC}tion- extbf{C}entric end-to-end OVSGG framework ( extbf{ACC}) in an interaction-driven paradigm to minimize these mismatches. For extit{interaction-centric knowledge infusion}, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model's interaction knowledge. For extit{interaction-centric knowledge transfer}, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Distinguishing interacting versus non-interacting object instances in scene graphs

Reducing noisy pseudo-supervision from mismatched objects during knowledge infusion

Addressing ambiguous query matching issues during knowledge transfer process

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interaction-centric framework for open-vocabulary scene graphs

Bidirectional interaction prompts for robust pseudo-supervision

Interaction-guided query selection and knowledge distillation

🔎 Similar Papers

BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation