SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high inference cost of large language models (LLMs) in practical deployment—where balancing accuracy and efficiency remains challenging—this paper proposes a dynamic mode-routing framework grounded in input problem complexity. The framework automatically selects between a high-cost “reasoning” mode (e.g., chain-of-thought) and a low-cost “non-reasoning” mode (direct generation) per input. We introduce a novel dual-state LLM architecture and a lightweight routing model, and propose the AIT index (Accuracy–Latency–Token trade-off) to holistically quantify multi-objective performance trade-offs. Evaluated on multiple medical question-answering benchmarks, our method achieves an accuracy of 0.8390—surpassing full reasoning-mode baselines—while reducing inference latency by 36.8% and token consumption by 39.66%. These gains significantly enhance cost-efficiency and user experience without compromising reliability.

Technology Category

Application Category

📝 Abstract
With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.
Problem

Research questions and friction points this paper is trying to address.

Balancing performance and cost in LLM model selection
Dynamically routing queries by complexity to optimize efficiency
Reducing over-reasoning on simple queries to improve accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-state LLM for cost-performance balance
ML-based dynamic query routing framework
AIT index for accuracy-latency-cost trade-offs
🔎 Similar Papers
No similar papers found.
W
Wencheng Zhang
Bytedance
S
Shiqin Qiao
Bytedance
L
Lingjie Luo
Bytedance
Yinfeng Li
Yinfeng Li
Xidian University
C
Chuanyang Zheng
The Chinese University of Hong Kong
Q
Qian Xu
Bytedance
M
Meng Li
Bytedance
Y
Yong Gui
Bytedance
Y
Yijun He
Bytedance
Jianing Qiu
Jianing Qiu
Assistant Professor, Mohamed bin Zayed University of Artificial Intelligence
Medical Foundation ModelAgentic Medical AIHuman-AI Interaction/Collaboration
J
Jindong Hong
Peking University
J
Jiankai Sun
Bytedance