TFL: Targeted Bit-Flip Attack on Large Language Model

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing bit-flip attacks struggle to precisely control the outputs of large language models (LLMs) on specific prompts, typically causing only non-targeted performance degradation. This work proposes TFL, a novel framework that achieves the first high-precision, low-perturbation targeted bit-flip attack against LLMs. By integrating keyword-focused attack objectives with a utility scoring mechanism that balances effectiveness and stealth, TFL optimizes both the location and number of bit flips. With fewer than 50 flipped bits, TFL successfully steers the outputs of models such as Qwen, DeepSeek, and Llama on targeted tasks—including DROP, GSM8K, and TriviaQA—while exerting minimal impact on unrelated inputs. This approach significantly outperforms existing methods in both accuracy and specificity.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks. Recent studies have shown that bit-flip attacks (BFAs), which exploit computer main memory (i.e., DRAM) vulnerabilities to flip a small number of bits in model weights, can severely disrupt LLM behavior. However, existing BFA on LLM largely induce un-targeted failure or general performance degradation, offering limited control over manipulating specific or targeted outputs. In this paper, we present TFL, a novel targeted bit-flip attack framework that enables precise manipulation of LLM outputs for selected prompts while maintaining almost no or minor degradation on unrelated inputs. Within our TFL framework, we propose a novel keyword-focused attack loss to promote attacker-specified target tokens in generative outputs, together with an auxiliary utility score that balances attack effectiveness against collateral performance impact on benign data. We evaluate TFL on multiple LLMs (Qwen, DeepSeek, Llama) and benchmarks (DROP, GSM8K, and TriviaQA). The experiments show that TFL achieves successful targeted LLM output manipulations with less than 50 bit flips and significantly reduced effect on unrelated queries compared to prior BFA approaches. This demonstrates the effectiveness of TFL and positions it as a new class of stealthy and targeted LLM model attack.
Problem

Research questions and friction points this paper is trying to address.

bit-flip attack
large language models
targeted attack
model robustness
parameter fault injection
Innovation

Methods, ideas, or system contributions that make the work stand out.

targeted bit-flip attack
large language models
model parameter fault injection
keyword-focused loss
stealthy model manipulation
🔎 Similar Papers
No similar papers found.