🤖 AI Summary
This work identifies a previously overlooked security vulnerability in large language models’ (LLMs) function-calling mechanisms: alignment bias, user prompting manipulation, and insufficient safety filtering collectively enable circumvention of content safety policies. We formally introduce the “jailbreak function” attack paradigm—the first systematic characterization of function-call–based jailbreaking—and develop an empirically validated attack framework. Our method achieves >90% success rates across six state-of-the-art models, including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro. To mitigate this threat, we propose a real-time defense strategy grounded in defensive prompt engineering and release fully reproducible, open-source code. Beyond naming and rigorously defining this novel class of jailbreak attacks—specifically those exploiting the function-calling interface—our work advances the field by establishing foundational benchmarks for evaluating function-call security and informing robust, deployable mitigation practices in production LLM systems.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities, but their power comes with significant security considerations. While extensive research has been conducted on the safety of LLMs in chat mode, the security implications of their function calling feature have been largely overlooked. This paper uncovers a critical vulnerability in the function calling process of LLMs, introducing a novel"jailbreak function"attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters. Our empirical study, conducted on six state-of-the-art LLMs including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro, reveals an alarming average success rate of over 90% for this attack. We provide a comprehensive analysis of why function calls are susceptible to such attacks and propose defensive strategies, including the use of defensive prompts. Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs, contributing to the field of AI safety by identifying a previously unexplored risk, designing an effective attack method, and suggesting practical defensive measures. Our code is available at https://github.com/wooozihui/jailbreakfunction.