🤖 AI Summary
This work addresses a critical safety gap in existing K-shot vision-language-action (VLA) reasoning methods, which often execute unsafe actions when all candidates are hazardous due to the absence of a reliable abstention mechanism. To remedy this, we propose BOKBO—the first conformal abstention layer tailored for K-shot VLA inference—offering finite-sample guarantees on violation rates without distributional assumptions. BOKBO introduces task-conditional calibration to uncover and correct structural failures in policy scoring under perturbed sampling and mitigates bias in force-threshold setting. It features global and Mondrian task-level variants, with the latter leveraging a learnable violation predictor based on semantic visual features and task identifiers to enable Mondrian calibration. Systematic comparisons of perturbation versus temperature sampling further inform design choices. Experiments demonstrate that BOKBO achieves 78% coverage and 70% net task success on libero_object_temp_x0.1, with the Mondrian variant improving the conditional retention ratio on the hardest task from 0.71 to 0.93, showing robustness across seeds, benchmarks, and distribution shifts.
📝 Abstract
Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe, the system executes a violating action with no warning. We propose BOKBO, the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. We provide both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks.
Our analysis exposes a structural failure of policy-internal nonconformity scores under perturbation-based K-sampling: the base-policy confidence proxy and K-sample disagreement correlate at 0.98 with the action-noise hyperparameter $σ$, while correlating at the noise floor with actual safety violations. We test the failure's scope by replicating the analysis under token-level temperature sampling and find the failure is mechanism-specific and partially mitigated under policy-stochasticity-based sampling. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at $ε$ = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86% of bootstrap splits with 78% coverage and 70% net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93. Results are stable across 5 training seeds, replicate within bootstrap noise on $π_0$-FAST, hold on libero_spatial_temp_x0.1 as a co-equal benchmark, and survive four within-suite distribution shifts. We additionally identify and correct a methodological pitfall: globally-set force thresholds well below expert-typical manipulation forces conflate unsafe behavior with normal manipulation, inflating violation rates by $5\times$.