Benchmarking Deception Probes via Black-to-White Performance Boosts

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study systematically evaluates the effectiveness and robustness of existing language model deception detectors in realistic settings, particularly their resilience against simple adversarial evasion strategies. Method: We employ linear classifiers as probes to compare detection performance under black-box (output-only) and white-box (internal activation access) monitoring paradigms. Crucially, we propose “black-to-white performance gain” as a core evaluation metric to quantify the improvement attributable to internal information accessibility. Results: Although current detectors exhibit overall weak performance, they consistently achieve statistically significant positive gains under white-box access—demonstrating both the efficacy and scalability of white-box monitoring. This metric establishes a reproducible, comparable benchmark for detector design, revealing transferable deception signals embedded in activation spaces. Our findings thus advance the development of robust, practical AI trustworthiness monitoring mechanisms.

Technology Category

Application Category

📝 Abstract

AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called "deception probes") have been trained to distinguish the internal activations of a language model during deceptive versus honest responses. However, it's unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.

Problem

Research questions and friction points this paper is trying to address.

Evaluating effectiveness of deception probes in AI assistants

Comparing white-box vs black-box deception detection performance

Assessing resistance of probes to evasion strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear classifiers detect deceptive AI responses

Compare white-box and black-box monitoring performance

Measure black-to-white performance boost effectiveness

🔎 Similar Papers

Benchmarking Cross-Domain Audio-Visual Deception Detection