🤖 AI Summary
Large language models (LLMs) are increasingly deployed in high-performance computing environments, yet the propagation mechanisms and impacts of soft errors during inference remain poorly understood. This work proposes LLMFI, a configurable and deterministic fault injection framework, and conducts fine-grained error injection experiments across three open-source LLMs and thirteen representative tasks spanning reasoning, multilingual processing, mathematical problem solving, and code generation. The study uncovers critical vulnerability patterns, distills seventeen core findings, and formulates four low-overhead, software-only reliability enhancement strategies. These contributions provide empirical foundations and practical guidance for designing fault-tolerant LLM systems.
📝 Abstract
Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.