🤖 AI Summary
This work addresses a critical limitation in existing optimization-based jailbreak attacks, such as GCG, which typically insert adversarial tokens at fixed positions and overlook the pivotal role of position selection in attack efficacy. To this end, we present SlotGCG—a plug-and-play, attack-agnostic position search mechanism that systematically quantifies the vulnerability of each prompt slot for the first time. SlotGCG introduces a Vulnerable Slot Score (VSS) to evaluate positional susceptibility and integrates it with optimization strategies like Greedy Coordinate Gradient to apply targeted perturbations at the most vulnerable locations. Extensive experiments demonstrate that SlotGCG improves attack success rates by 14% over GCG across multiple large language models, converges faster, and outperforms baselines by up to 42% under defensive settings, with only approximately 200ms of additional preprocessing overhead.
📝 Abstract
As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate \emph{slots}, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the \emph{slots}. Based on these findings, we introduce the \textit{Vulnerable Slot Score} (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position-search mechanism that is attack-agnostic and can be plugged into any optimization-based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14\% higher Attack Success Rates (ASR) over GCG-based attacks, converges faster, and shows superior robustness against defense methods with 42\% higher ASR than baseline approaches. Our implementation is available at \href{https://github.com/youai058/SlotGCG}{https://github.com/youai058/SlotGCG}