🤖 AI Summary
Visual explanation methods (e.g., CAM) often lack structural faithfulness—failing to precisely localize essential substructures while suppressing background interference. To address this, we introduce the first benchmark grounded in QR code geometric standards, incorporating a structure-aware evaluation framework that jointly leverages causal masking and fidelity analysis. Crucially, we pioneer the use of QR code geometric priors for quantitative interpretability assessment. Our benchmark systematically evaluates mainstream CAM variants—including LayerCAM and EigenGrad-CAM—across structural sensitivity, robustness to controlled deformations, and inference latency, using synthetically generated QR/non-QR data, pixel-accurate ground-truth masks, and structure-aware distance metrics. It supports both zero-shot and fine-tuned evaluation, and provides a fully reproducible implementation with training protocols. Empirical results validate its effectiveness as a rigorous “litmus test” for structural interpretability, demonstrating strong generalizability across models and tasks.
📝 Abstract
Visual explanations are often plausible but not structurally faithful. We introduce CAMBench-QR, a structure-aware benchmark that leverages the canonical geometry of QR codes (finder patterns, timing lines, module grid) to test whether CAM methods place saliency on requisite substructures while avoiding background. CAMBench-QR synthesizes QR/non-QR data with exact masks and controlled distortions, and reports structure-aware metrics (Finder/Timing Mass Ratios, Background Leakage, coverage AUCs, Distance-to-Structure) alongside causal occlusion, insertion/deletion faithfulness, robustness, and latency. We benchmark representative, efficient CAMs (LayerCAM, EigenGrad-CAM, XGrad-CAM) under two practical regimes of zero-shot and last-block fine-tuning. The benchmark, metrics, and training recipes provide a simple, reproducible yardstick for structure-aware evaluation of visual explanations. Hence we propose that CAMBENCH-QR can be used as a litmus test of whether visual explanations are truly structure-aware.