🤖 AI Summary
To address inconsistent predictions, missed detections, and misclassifications in static first-person object detection—caused by neglecting spatial layout priors—this paper proposes a graph neural network (GNN)-based post-processing framework. It introduces graph-structured modeling for egocentric spatial context for the first time: objects serve as nodes, while geometric and semantic proximity relations define edges, forming a scene graph; GNNs then aggregate neighborhood information to automatically identify and rectify detection anomalies. The method is detector-agnostic, compatible with mainstream detectors such as YOLOv7 and RT-DETR, and requires only minimal human annotation for training. On standard benchmarks, it achieves up to a 4.0% improvement in mAP@50, significantly enhancing detection consistency and robustness under occlusion and cluttered scenes. Key contributions include: (1) the first spatial relational graph modeling paradigm tailored for egocentric vision, and (2) a plug-and-play detection refinement approach that requires no detector retraining.
📝 Abstract
In many real-world applications involving static environments, the spatial layout of objects remains consistent across instances. However, state-of-the-art object detection models often fail to leverage this spatial prior, resulting in inconsistent predictions, missed detections, or misclassifications, particularly in cluttered or occluded scenes. In this work, we propose a graph-based post-processing pipeline that explicitly models the spatial relationships between objects to correct detection anomalies in egocentric frames. Using a graph neural network (GNN) trained on manually annotated data, our model identifies invalid object class labels and predicts corrected class labels based on their neighbourhood context. We evaluate our approach both as a standalone anomaly detection and correction framework and as a post-processing module for standard object detectors such as YOLOv7 and RT-DETR. Experiments demonstrate that incorporating this spatial reasoning significantly improves detection performance, with mAP@50 gains of up to 4%. This method highlights the potential of leveraging the environment's spatial structure to improve reliability in object detection systems.