🤖 AI Summary
This work addresses the significant increase in time-to-first-token (TTFT) observed in reasoning-oriented large language models due to their inability to distinguish between the reasoning and response phases in chain-of-thought (CoT) inference, as well as the challenge faced by existing schedulers in balancing latency and quality of service under GPU memory constraints. To resolve this, the authors propose a phase-aware hierarchical scheduling framework that explicitly separates the reasoning and response phases for the first time: prioritizing requests during the reasoning phase to minimize TTFT, and applying controlled preemption combined with token pacing during the response phase to meet service-level objectives (SLOs). The framework further enables dynamic phase-boundary migration through coordinated instance-level placement and intra-instance execution, achieving load balancing and interference mitigation. Evaluated on DeepSeek-R1-Distill-Qwen-32B, the approach reduces tail TTFT by up to 72% while maintaining high SLO compliance.
📝 Abstract
The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.