🤖 AI Summary
To address the need for efficient and scalable deployment of large language model (LLM) inference systems, this paper introduces the first hardware-software co-simulation platform supporting dynamic, token-level fine-grained modeling. The platform integrates a customized instruction-level simulator, an LLM workload abstraction layer, lightweight memory behavior modeling, and a plugin-enabled scheduling interface—enabling joint exploration of hardware architecture, runtime scheduling, and memory management. Validated on real-world datasets, it achieves sub-1% modeling error and enables key system optimizations, including substantial improvements in service throughput and GPU memory utilization. Its core contribution lies in overcoming the limitations of conventional coarse-grained modeling by pioneering dynamic token-level performance modeling and cross-stack co-optimization across the full inference stack.
📝 Abstract
The increasing demand for large language model (LLM) serving has necessitated significant advancements in the optimization and profiling of LLM inference systems. As these models become integral to a wide range of applications, the need for efficient and scalable serving solutions has grown exponentially. This work introduces TokenSim, a comprehensive hardware and software exploration system designed specifically for LLM inference. TokenSim is characterized by its support for extensible system optimizations including scheduling and memory management. We validate the results with systems running with realworld datasets, achieving an error rate of less than 1%. Furthermore, TokenSim facilitates various insightful explorations into the performance and optimization of LLM serving systems.