🤖 AI Summary
This study addresses adversarial attacks against both automated and human lie detection systems. We propose a target-aligned adversarial language rewriting method: leveraging general-purpose large language models (LLMs) to generate customized paraphrases of deceptive autobiographical statements that simultaneously evade detection by human subjects and machine classifiers. Our key contribution is the systematic identification and empirical validation of *target alignment*—the principle that attack efficacy critically depends on whether rewrites are specifically tailored to exploit either human cognitive biases or model-specific decision mechanisms. Experiments demonstrate that under target-aligned attacks, human judgment effect size drops to *d* ≈ 0 and machine accuracy falls to 51% (near chance level); in contrast, non-aligned attacks yield significantly higher performance (*d* = 0.36; accuracy = 63–78%). These results establish that off-the-shelf LLMs—without task-specific fine-tuning—can mount highly covert, cross-subject adversarial deception.
📝 Abstract
Background: Deception detection through analysing language is a promising avenue using both human judgments and automated machine learning judgments. For both forms of credibility assessment, automated adversarial attacks that rewrite deceptive statements to appear truthful pose a serious threat. Methods: We used a dataset of 243 truthful and 262 fabricated autobiographical stories in a deception detection task for humans and machine learning models. A large language model was tasked to rewrite deceptive statements so that they appear truthful. In Study 1, humans who made a deception judgment or used the detailedness heuristic and two machine learning models (a fine-tuned language model and a simple n-gram model) judged original or adversarial modifications of deceptive statements. In Study 2, we manipulated the target alignment of the modifications, i.e. tailoring the attack to whether the statements would be assessed by humans or computer models. Results: When adversarial modifications were aligned with their target, human (d=-0.07 and d=-0.04) and machine judgments (51% accuracy) dropped to the chance level. When the attack was not aligned with the target, both human heuristics judgments (d=0.30 and d=0.36) and machine learning predictions (63-78%) were significantly better than chance. Conclusions: Easily accessible language models can effectively help anyone fake deception detection efforts both by humans and machine learning models. Robustness against adversarial modifications for humans and machines depends on that target alignment. We close with suggestions on advancing deception research with adversarial attack designs.