🤖 AI Summary
Existing scene graph generation models struggle to learn reliable visual commonsense under sparse annotations, leading to significantly degraded performance on rare relationships. This work proposes a model-agnostic, semantics-guided knowledge refinement framework that automatically uncovers spatial, functional, and qualitative relational patterns from training data during inference and dynamically corrects predictions through declarative commonsense reasoning. Requiring neither handcrafted rules nor model retraining, the method is readily transferable across datasets and architectures, and represents the first effective integration of structured visual commonsense reasoning into purely learning-driven scene graph generation pipelines. It consistently outperforms strong baselines across three standard benchmarks, demonstrating the critical role of visual commonsense in enhancing the robustness and accuracy of scene graph generation.
📝 Abstract
Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.