TFL: Targeted Bit-Flip Attack on Large Language Model

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing bit-flip attacks struggle to precisely control the outputs of large language models (LLMs) on specific prompts, typically causing only non-targeted performance degradation. This work proposes TFL, a novel framework that achieves the first high-precision, low-perturbation targeted bit-flip attack against LLMs. By integrating keyword-focused attack objectives with a utility scoring mechanism that balances effectiveness and stealth, TFL optimizes both the location and number of bit flips. With fewer than 50 flipped bits, TFL successfully steers the outputs of models such as Qwen, DeepSeek, and Llama on targeted tasks—including DROP, GSM8K, and TriviaQA—while exerting minimal impact on unrelated inputs. This approach significantly outperforms existing methods in both accuracy and specificity.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks. Recent studies have shown that bit-flip attacks (BFAs), which exploit computer main memory (i.e., DRAM) vulnerabilities to flip a small number of bits in model weights, can severely disrupt LLM behavior. However, existing BFA on LLM largely induce un-targeted failure or general performance degradation, offering limited control over manipulating specific or targeted outputs. In this paper, we present TFL, a novel targeted bit-flip attack framework that enables precise manipulation of LLM outputs for selected prompts while maintaining almost no or minor degradation on unrelated inputs. Within our TFL framework, we propose a novel keyword-focused attack loss to promote attacker-specified target tokens in generative outputs, together with an auxiliary utility score that balances attack effectiveness against collateral performance impact on benign data. We evaluate TFL on multiple LLMs (Qwen, DeepSeek, Llama) and benchmarks (DROP, GSM8K, and TriviaQA). The experiments show that TFL achieves successful targeted LLM output manipulations with less than 50 bit flips and significantly reduced effect on unrelated queries compared to prior BFA approaches. This demonstrates the effectiveness of TFL and positions it as a new class of stealthy and targeted LLM model attack.

Problem

Research questions and friction points this paper is trying to address.

bit-flip attack

large language models

targeted attack

model robustness

parameter fault injection

Innovation

Methods, ideas, or system contributions that make the work stand out.

targeted bit-flip attack

large language models

model parameter fault injection