PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant increase in time-to-first-token (TTFT) observed in reasoning-oriented large language models due to their inability to distinguish between the reasoning and response phases in chain-of-thought (CoT) inference, as well as the challenge faced by existing schedulers in balancing latency and quality of service under GPU memory constraints. To resolve this, the authors propose a phase-aware hierarchical scheduling framework that explicitly separates the reasoning and response phases for the first time: prioritizing requests during the reasoning phase to minimize TTFT, and applying controlled preemption combined with token pacing during the response phase to meet service-level objectives (SLOs). The framework further enables dynamic phase-boundary migration through coordinated instance-level placement and intra-instance execution, achieving load balancing and interference mitigation. Evaluated on DeepSeek-R1-Distill-Qwen-32B, the approach reduces tail TTFT by up to 72% while maintaining high SLO compliance.

Technology Category

Application Category

📝 Abstract
The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.
Problem

Research questions and friction points this paper is trying to address.

reasoning-based LLMs
Chain-of-Thought
Time-To-First-Token
LLM serving
phase-aware scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

phase-aware scheduling
Chain-of-Thought inference
Time-To-First-Token (TTFT)
controlled preemption
hierarchical scheduler
🔎 Similar Papers
No similar papers found.