What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This paper addresses the lack of systematic understanding in test-time scaling (TTS) for large language models (LLMs). We propose the first unified four-dimensional framework—“what to scale, how to scale, where to scale, and how much to scale”—covering scaling objectives, mechanisms, architectural locations, and efficacy. Through systematic literature review and conceptual abstraction, we integrate key techniques—including chain-of-thought prompting, self-verification, re-ranking, parallelized reasoning, and resource scheduling—to establish the first structured TTS taxonomy. Our framework elucidates the technical essence of TTS, clarifies its evolutionary trajectory, unifies evaluation criteria, and identifies current bottlenecks. Furthermore, it identifies scalability, cross-task generalization, and attribution analysis as critical future research directions—providing both theoretical foundations and practical guidance for designing and deploying TTS methods.

Technology Category

Application Category

📝 Abstract

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.

Problem

Research questions and friction points this paper is trying to address.

Surveying test-time scaling in LLMs for enhanced problem-solving

Proposing a framework for understanding TTS research dimensions

Identifying challenges and future directions in TTS development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time scaling enhances LLM problem-solving

Unified framework for TTS research dimensions

Guidelines for practical TTS deployment

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models