Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current visual relationship detection (VRD) models are constrained by fixed predicate vocabularies and struggle to generalize to unseen, unlabeled relationships. To address this, we propose a novel open-world VRD paradigm: leveraging large language models (LLMs) to generate structured relational priors, integrated with an iterative visual grounding mechanism that jointly optimizes scene graph hypotheses and perceptual evidence. We introduce the first EM-style “hallucination–grounding–refinement” framework, enabling zero-shot predicate generalization. We further construct the first Visual Genome-based open-world VRD benchmark, featuring 21 reserved predicates for evaluation. Our method employs co-training of LLMs and vision encoders, coupled with multi-stage scene graph alignment. Under seen, unseen, and mixed settings, it achieves mR@50 scores of 15.9, 13.1, and 11.7, respectively—substantially outperforming LLM-only, few-shot, and debiased baselines.

Technology Category

Application Category

📝 Abstract

Understanding relationships between objects is central to visual intelligence, with applications in embodied AI, assistive systems, and scene understanding. Yet, most visual relationship detection (VRD) models rely on a fixed predicate set, limiting their generalization to novel interactions. A key challenge is the inability to visually ground semantically plausible, but unannotated, relationships hypothesized from external knowledge. This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors. Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM (expectation) and training a visual model to align these hypotheses with perceptual evidence (maximization). This process bootstraps relational understanding beyond annotated data and enables generalization to unseen predicates. Additionally, we introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50) of 15.9, 13.1, and 11.7 on predicate classification on these three sets. These results highlight the promise of grounded LLM priors for scalable open-world visual understanding.

Problem

Research questions and friction points this paper is trying to address.

Generalizing visual relationship detection beyond fixed predicate sets

Grounding plausible but unannotated relationships using external knowledge

Improving open-world VRD with iterative LLM-based visual alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs as structured relational priors

Iterative visual grounding with EM framework

Bootstraps relational understanding beyond annotations

🔎 Similar Papers

No similar papers found.