🤖 AI Summary
This work addresses the challenge that existing large language models struggle to simultaneously evade AI detectors while preserving semantic fidelity, often suffering from fine-grained semantic degradation or unstable trade-offs. The authors formulate adversarial rewriting as a constrained Markov decision process, where detection evasion serves as the primary objective and semantic preservation is enforced as an explicit constraint. They propose DEPO, an optimization framework based on Lagrangian dual reinforcement learning, augmented with a GRPO-style grouped policy update mechanism to adaptively balance these competing objectives. Experimental results demonstrate that DEPO significantly improves attack success rates while strictly adhering to semantic similarity constraints, exhibiting strong robustness and cross-domain generalization across multiple datasets, detectors, and prompt settings.
📝 Abstract
AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack precise control over semantic preservation. In particular, optimizing directly for detector evasion can degrade fine-grained semantics, whereas scalarized reward designs provide only indirect, weight-sensitive control over the evasion-semantics trade-off. We address this limitation by formulating detector-evasive LLM paraphrasing as a Constrained Markov Decision Process, where detector evasion is the primary objective and semantic preservation is enforced as an explicit constraint. We propose Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual reinforcement learning algorithm with a novel GRPO-style group-based policy update. DEPO adaptively balances semantic preservation and detector evasion during training, enabling the policy to improve attack success within a prescribed semantic-preservation region. Experiments on MAGE, M4, RAID, and peer-review datasets, evaluated against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors, show that DEPO achieves strong detector evasion while precisely satisfying the semantic preservation constraint. DEPO also exhibits cross-domain, cross-detector, and prompt-level robustness.