🤖 AI Summary
In multi-tenant DNN inference scenarios, co-locating high-priority latency-sensitive (LS) and low-priority best-effort (BE) workloads on GPUs incurs severe VRAM channel hash conflicts and PCIe bandwidth contention, resulting in high LS tail latency and low BE throughput. To address this, we propose the first software-only fine-grained GPU resource isolation framework. Our approach reverse-engineers NVIDIA’s proprietary VRAM channel hashing mechanism and employs cache coloring to eliminate VRAM access interference. We further design a tenant-aware Completely Fair PCIe Scheduler that dynamically allocates PCIe bandwidth quotas per tenant. Without modifying hardware or drivers, our framework reduces LS 99th-percentile latency by 50% and improves BE throughput by 6.1×. Crucially, it enables precise, tenant-level PCIe bandwidth isolation—achieving both generality across diverse DNN workloads and practical deployability in production environments.
📝 Abstract
Colocating high-priority, latency-sensitive (LS) and low-priority, best-effort (BE) DNN inference services reduces the total cost of ownership (TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts and PCIe bus contentions, existing GPU sharing solutions are unable to avoid resource conflicts among concurrently executing tasks, failing to achieve both low latency for LS tasks and high throughput for BE tasks. To bridge this gap, this paper presents Missile, a general GPU sharing solution for multi-tenant DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware resource isolation between multiple LS and BE DNN tasks at software level. Through comprehensive reverse engineering, Missile first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel conflicts using software-level cache coloring. It also isolates the PCIe bus and fairly allocates PCIe bandwidth using completely fair scheduler. We evaluate 12 mainstream DNNs with synthetic and real-world workloads on four GPUs. The results show that compared to the state-of-the-art GPU sharing solutions, Missile reduces tail latency for LS services by up to ~50%, achieves up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants on-demand for optimal performance.