TokenSim: Enabling Hardware and Software Exploration for Large Language Model Inference Systems

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the need for efficient and scalable deployment of large language model (LLM) inference systems, this paper introduces the first hardware-software co-simulation platform supporting dynamic, token-level fine-grained modeling. The platform integrates a customized instruction-level simulator, an LLM workload abstraction layer, lightweight memory behavior modeling, and a plugin-enabled scheduling interface—enabling joint exploration of hardware architecture, runtime scheduling, and memory management. Validated on real-world datasets, it achieves sub-1% modeling error and enables key system optimizations, including substantial improvements in service throughput and GPU memory utilization. Its core contribution lies in overcoming the limitations of conventional coarse-grained modeling by pioneering dynamic token-level performance modeling and cross-stack co-optimization across the full inference stack.

Technology Category

Application Category

📝 Abstract

The increasing demand for large language model (LLM) serving has necessitated significant advancements in the optimization and profiling of LLM inference systems. As these models become integral to a wide range of applications, the need for efficient and scalable serving solutions has grown exponentially. This work introduces TokenSim, a comprehensive hardware and software exploration system designed specifically for LLM inference. TokenSim is characterized by its support for extensible system optimizations including scheduling and memory management. We validate the results with systems running with realworld datasets, achieving an error rate of less than 1%. Furthermore, TokenSim facilitates various insightful explorations into the performance and optimization of LLM serving systems.

Problem

Research questions and friction points this paper is trying to address.

Optimizing large language model inference systems

Enabling scalable and efficient LLM serving solutions

Exploring hardware and software for LLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

TokenSim optimizes LLM inference systems

Supports extensible scheduling and memory management

Validated with less than 1% error rate

🔎 Similar Papers

No similar papers found.