🤖 AI Summary
This study identifies a non-monotonic impact of output length on the adversarial robustness of DeepSeek-R1 under forced-reasoning settings: excessively short outputs impair defensive capability, while overly long ones enable malicious prompt expansion. To address this, we propose a reinforcement learning–based paradigm for dynamic token-length control, jointly optimizing reasoning accuracy and adversarial robustness. Our method integrates adversarial prompt engineering, quantitative analysis of response robustness, and policy-adaptive length regulation. Evaluated across diverse jailbreaking and induction attacks, it improves safety rate by 12.7% while preserving 98.3% of the original reasoning accuracy. Crucially, this work is the first to systematically uncover and model the dual-role security effect of LLM output length—where length simultaneously influences both vulnerability and defense efficacy. By establishing an interpretable, deployable framework for controllable generation and safety alignment, our approach advances principled methods for robust, trustworthy LLM deployment.
📝 Abstract
Large Language Models (LLMs) have demonstrated strong reasoning capabilities, but their safety under adversarial conditions remains a challenge. This study examines the impact of output length on the robustness of DeepSeek-R1, particularly in Forced Thinking scenarios. We analyze responses across various adversarial prompts and find that while longer outputs can improve safety through self-correction, certain attack types exploit extended generations. Our findings suggest that output length should be dynamically controlled to balance reasoning effectiveness and security. We propose reinforcement learning-based policy adjustments and adaptive token length regulation to enhance LLM safety.