Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large language models (LLMs) are increasingly deployed in high-performance computing environments, yet the propagation mechanisms and impacts of soft errors during inference remain poorly understood. This work proposes LLMFI, a configurable and deterministic fault injection framework, and conducts fine-grained error injection experiments across three open-source LLMs and thirteen representative tasks spanning reasoning, multilingual processing, mathematical problem solving, and code generation. The study uncovers critical vulnerability patterns, distills seventeen core findings, and formulates four low-overhead, software-only reliability enhancement strategies. These contributions provide empirical foundations and practical guidance for designing fault-tolerant LLM systems.

📝 Abstract

Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.

Problem

Research questions and friction points this paper is trying to address.

error propagation

large language models

soft errors

LLM inference

fault tolerance

Innovation

Methods, ideas, or system contributions that make the work stand out.

error propagation

large language models

fault injection