LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

📅 2026-01-25

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

Large language model inference lacks output determinism due to floating-point non-associativity, dynamic batching, and varying GPU reduction orders. This work proposes a scheduling-based speculative validation mechanism that introduces speculative execution into deterministic inference for the first time. By employing lightweight validate-and-rollback cycles combined with fixed-shape reduction scheduling, the approach incurs overhead only for requests requiring determinism, while remaining compatible with dynamic batching and requiring minimal modification to existing GPU kernels. The method decouples determinism guarantees from low-level implementation details, achieving high throughput and significantly outperforming baseline strategies such as disabling dynamic batching or rewriting kernel functions.

Technology Category

Application Category

📝 Abstract

In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non-determinism arises from floating-point non-associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non-determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch-invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM-42, a scheduling-based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token is likely to be consistent even with dynamic batching. Moreover, most GPU kernels use shape-consistent reductions. Leveraging these insights, LLM-42 decodes tokens using a non-deterministic fast path and enforces determinism via a lightweight verify-rollback loop. The verifier replays candidate tokens under a fixed-shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. LLM-42 mostly re-uses existing kernels unchanged and incurs overhead only in proportion to the traffic that requires determinism.

Problem

Research questions and friction points this paper is trying to address.

non-determinism

LLM inference

dynamic batching

floating-point non-associativity

GPU kernels

Innovation

Methods, ideas, or system contributions that make the work stand out.

determinism

speculative decoding

dynamic batching

verified speculation

LLM inference

🔎 Similar Papers

FaithLM: Towards Faithful Explanations for Large Language Models

2024-02-07Citations: 3

Authors to Follow