TokenSim: Enabling Hardware and Software Exploration for Large Language Model Inference Systems

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the need for efficient and scalable deployment of large language model (LLM) inference systems, this paper introduces the first hardware-software co-simulation platform supporting dynamic, token-level fine-grained modeling. The platform integrates a customized instruction-level simulator, an LLM workload abstraction layer, lightweight memory behavior modeling, and a plugin-enabled scheduling interface—enabling joint exploration of hardware architecture, runtime scheduling, and memory management. Validated on real-world datasets, it achieves sub-1% modeling error and enables key system optimizations, including substantial improvements in service throughput and GPU memory utilization. Its core contribution lies in overcoming the limitations of conventional coarse-grained modeling by pioneering dynamic token-level performance modeling and cross-stack co-optimization across the full inference stack.

Technology Category

Application Category

📝 Abstract
The increasing demand for large language model (LLM) serving has necessitated significant advancements in the optimization and profiling of LLM inference systems. As these models become integral to a wide range of applications, the need for efficient and scalable serving solutions has grown exponentially. This work introduces TokenSim, a comprehensive hardware and software exploration system designed specifically for LLM inference. TokenSim is characterized by its support for extensible system optimizations including scheduling and memory management. We validate the results with systems running with realworld datasets, achieving an error rate of less than 1%. Furthermore, TokenSim facilitates various insightful explorations into the performance and optimization of LLM serving systems.
Problem

Research questions and friction points this paper is trying to address.

Optimizing large language model inference systems
Enabling scalable and efficient LLM serving solutions
Exploring hardware and software for LLM performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

TokenSim optimizes LLM inference systems
Supports extensible scheduling and memory management
Validated with less than 1% error rate
🔎 Similar Papers
No similar papers found.
Feiyang Wu
Feiyang Wu
Georgia Institute of Technology
Reinforcement LearningDeep Learning
Z
Zhuohang Bian
Beihang University
G
Guoyang Duan
Peking University
T
Tianle Xu
Peking University
J
Junchi Wu
Peking University
T
Teng Ma
Renmin University of China & Alibaba Group
Y
Yongqiang Yao
Sensetime
R
Ruihao Gong
Sensetime
Youwei Zhuo
Youwei Zhuo
Peking University