🤖 AI Summary
In real-world deployments, the absence of ground-truth annotations impedes reliable monitoring and comparative evaluation of object detection models. To address this, we propose the Cumulative Consensus Score (CCS), a label-free, model-agnostic online evaluation metric. CCS generates multiple augmented views of each test sample, computes spatial consensus among predicted bounding boxes via IoU-based overlap analysis, normalizes scores using maximum pairwise overlap, and accumulates reliability estimates across detections. It enables fine-grained, scene-level performance assessment and is compatible with both one-stage and two-stage detectors, supporting DevOps-style continuous monitoring. Experiments on Open Images and KITTI demonstrate that CCS achieves over 90% correlation with supervised metrics—including F1-score, PDQ, and OCC—while robustly identifying low-performance scenes. CCS exhibits strong robustness to annotation noise and practical utility in production environments.
📝 Abstract
Evaluating object detection models in deployment is challenging because ground-truth annotations are rarely available. We introduce the Cumulative Consensus Score (CCS), a label-free metric that enables continuous monitoring and comparison of detectors in real-world settings. CCS applies test-time data augmentation to each image, collects predicted bounding boxes across augmented views, and computes overlaps using Intersection over Union. Maximum overlaps are normalized and averaged across augmentation pairs, yielding a measure of spatial consistency that serves as a proxy for reliability without annotations. In controlled experiments on Open Images and KITTI, CCS achieved over 90% congruence with F1-score, Probabilistic Detection Quality, and Optimal Correction Cost. The method is model-agnostic, working across single-stage and two-stage detectors, and operates at the case level to highlight under-performing scenarios. Altogether, CCS provides a robust foundation for DevOps-style monitoring of object detectors.