π€ AI Summary
This work addresses the performance bottlenecks and low resource utilization in existing large language model (LLM) inference systems, which rely on the host CPU for orchestration and per-token control, making them susceptible to interference. The paper proposes Blink, the first end-to-end architecture that fully decouples steady-state LLM inference from the CPU. Blink offloads request handling to a SmartNIC, leverages RDMA for zero-copy direct transfers into GPU memory, and employs a persistent GPU kernel to unify batching, scheduling, and KV cache management. Compared to state-of-the-art systems such as TensorRT-LLM, vLLM, and SGLang, Blink reduces P99 time-to-first-token latency by up to 8.47Γ, cuts per-token processing time by 3.40Γ, improves decoding throughput by 2.1Γ, lowers energy consumption by 48.6%, and maintains stable performance under CPU interference.
π Abstract
Large Language Model (LLM) inference is rapidly becoming a core datacenter service, yet current serving stacks keep the host CPU on the critical path for orchestration and token-level control. This makes LLM performance sensitive to CPU interference, undermining application colocation and forcing operators to reserve CPU headroom, leaving substantial capacity unutilized.
We introduce Blink, an end-to-end serving architecture that removes the host CPU from the steady-state inference path by redistributing responsibilities across a SmartNIC and a GPU. Blink offloads request handling to the SmartNIC, which delivers inputs directly into GPU memory via RDMA, and replaces host-driven scheduling with a persistent GPU kernel that performs batching, scheduling, and KV-cache management without CPU involvement.
Evaluated against TensorRT-LLM, vLLM, and SGLang, Blink outperforms all baselines even in isolation, reducing pre-saturation P99 TTFT by up to 8.47$\times$ and P99 TPOT by up to 3.40$\times$, improving decode throughput by up to 2.1$\times$, and reducing energy per token by up to 48.6$\%$. Under CPU interference, Blink maintains stable performance, while existing systems degrade by up to two orders of magnitude.