π€ AI Summary
Although current large language models undergo safety alignment, they remain vulnerable to jailbreaking attacks, primarily due to the neglect of their persistent response mechanisms to dialogue memory and user instructions. This work proposes Persona Attack, the first approach that integrates progressive instruction injection with contextual memory manipulation. By leveraging multi-turn dialogue state tracking and optimizing instruction composition, the method systematically exploits the modelβs memory mechanism to undermine its internal safety alignment. Evaluated across multiple mainstream large language models, Persona Attack achieves jailbreaking success rates as high as 95%, demonstrating that attack efficacy critically depends on both the modelβs memory implementation and the strategy for instruction composition. These findings underscore the pivotal role of memory mechanisms in the security posture of aligned language models.
π Abstract
As Large Language Models evolve for user convenience, vulnerability to jailbreak attacks continues to be reported despite ongoing efforts in safety training. Traditional jailbreak techniques typically focus on a single prompt injection, neglecting the models' ability to remember the flow of conversation and the user's instructions. In this paper, we propose Persona Attack, a memory injection based jailbreak method that manipulates the model's context window through a step by step approach. Experimental results from applying Persona Attack to several widely used LLMs reveal that, as injections accumulate in memory, models increasingly prioritize these instructions over their internal safety alignment mechanisms. Furthermore, our experiments empirically demonstrate that the attack success rate varies not only according to the memory implementation of the model, but also combinations of instructions and can reach 95% under specific instruction configurations.