🤖 AI Summary
This work exposes a critical semantic-level vulnerability in state-of-the-art safety-aligned large language models (LLMs): their overreliance on superficial token patterns renders them incapable of detecting harmful intent concealed via structural transformations. To address this, the authors propose the first grammar-space-agnostic structural transformation attack paradigm. It maps natural-language instructions into diverse structured grammar spaces—such as SQL or LLM-generated custom grammars—enabling semantically preserving adversarial recoding. The method integrates structured format encoding, LLM-driven grammar generation, and adaptive structure-content co-transformation, and introduces a dedicated adversarial benchmark. Experiments demonstrate >96% attack success rate (ASR) and 0% refusal rate on strongly aligned models like Claude 3.5 Sonnet; mainstream alignment techniques universally fail (ASR = 100%), revealing a fundamental semantic blind spot in current safety mechanisms.
📝 Abstract
In this work, we present a series of structure transformation attacks on LLM alignment, where we encode natural language intent using diverse syntax spaces, ranging from simple structure formats and basic query languages (e.g. SQL) to new novel spaces and syntaxes created entirely by LLMs. Our extensive evaluation shows that our simplest attacks can achieve close to 90% success rate, even on strict LLMs (such as Claude 3.5 Sonnet) using SOTA alignment mechanisms. We improve the attack performance further by using an adaptive scheme that combines structure transformations along with existing extit{content transformations}, resulting in over 96% ASR with 0% refusals. To generalize our attacks, we explore numerous structure formats, including syntaxes purely generated by LLMs. Our results indicate that such novel syntaxes are easy to generate and result in a high ASR, suggesting that defending against our attacks is not a straightforward process. Finally, we develop a benchmark and evaluate existing safety-alignment defenses against it, showing that most of them fail with 100% ASR. Our results show that existing safety alignment mostly relies on token-level patterns without recognizing harmful concepts, highlighting and motivating the need for serious research efforts in this direction. As a case study, we demonstrate how attackers can use our attack to easily generate a sample malware, and a corpus of fraudulent SMS messages, which perform well in bypassing detection.