🤖 AI Summary
This study systematically evaluates the effectiveness and robustness of existing language model deception detectors in realistic settings, particularly their resilience against simple adversarial evasion strategies.
Method: We employ linear classifiers as probes to compare detection performance under black-box (output-only) and white-box (internal activation access) monitoring paradigms. Crucially, we propose “black-to-white performance gain” as a core evaluation metric to quantify the improvement attributable to internal information accessibility.
Results: Although current detectors exhibit overall weak performance, they consistently achieve statistically significant positive gains under white-box access—demonstrating both the efficacy and scalability of white-box monitoring. This metric establishes a reproducible, comparable benchmark for detector design, revealing transferable deception signals embedded in activation spaces. Our findings thus advance the development of robust, practical AI trustworthiness monitoring mechanisms.
📝 Abstract
AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called "deception probes") have been trained to distinguish the internal activations of a language model during deceptive versus honest responses. However, it's unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.