🤖 AI Summary
To address the significant performance degradation of object detection under occlusion, this paper proposes a robust detection framework leveraging scene context modeling. Methodologically, it introduces two interpretable information fusion mechanisms: dynamic network selection prior to prediction and scene-aware score fusion post-prediction. These are tightly coupled with a scene classification network and an RPN-DCNN backbone, enabling cross-dataset transfer without requiring large-scale scene-level annotations. The key innovation lies in explicitly incorporating scene context into both the model architecture and post-processing stages, thereby balancing structural adaptability and knowledge interpretability. Experiments on challenging partially occluded datasets demonstrate substantial improvements in both recall and precision. Moreover, joint training on mixed occluded and non-occluded images consistently outperforms single-scenario training strategies. This work provides a novel conceptual framework and a practical technical pathway toward occlusion-robust object detection.
📝 Abstract
The presence of occlusions has provided substantial challenges to typically-powerful object recognition algorithms. Additional sources of information can be extremely valuable to reduce errors caused by occlusions. Scene context is known to aid in object recognition in biological vision. In this work, we attempt to add robustness into existing Region Proposal Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks through two distinct scene-based information fusion techniques. We present one algorithm under each methodology: the first operates prior to prediction, selecting a custom object network to use based on the identified background scene, and the second operates after detection, fusing scene knowledge into initial object scores output by the RPN. We demonstrate our algorithms on challenging datasets featuring partial occlusions, which show overall improvement in both recall and precision against baseline methods. In addition, our experiments contrast multiple training methodologies for occlusion handling, finding that training on a combination of both occluded and unoccluded images demonstrates an improvement over the others. Our method is interpretable and can easily be adapted to other datasets, offering many future directions for research and practical applications.