🤖 AI Summary
This work addresses the throughput limitations of GPU systems caused by host-device synchronization latency and kernel scheduling overhead, which hinder efficient utilization of compute cores and copy engines. The authors propose a CUDA runtime framework tailored for task-parallel pipelining, which innovatively integrates multi-stream scheduling, event-chain triggering, work stealing, and stream-level buffer management. This design ensures memory safety across concurrent tasks while substantially reducing kernel launch intervals and synchronization overhead. Implemented using CUDA Graphs, the framework achieves 1.15–1.44× speedup over state-of-the-art baselines on real-world workloads and reduces scheduling overhead by 18%–54%.
📝 Abstract
Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for task-parallel pipelines to minimize the synchronization overheads and the gap between kernel executions. The proposed solution combines two innovations: (1) a multi-stream task-parallel pipeline programming model that leverages event-chaining and work-stealing mechanisms to fully utilize available hardware resources; (2) a graph-based execution flow with per-stream buffers to ensure memory safety for multiple in-flight jobs running concurrently. Extensive evaluations on representative real-world workloads show 1.15--1.44X speedup and reduce scheduling overheads by 18--54% compared to state-of-the-art CUDA graph baselines.