🤖 AI Summary
This work addresses the lack of systematic investigation into test-time scaling (TTS) for multimodal foundation models during inference. It establishes the first unified theoretical framework for this emerging area, categorizing existing approaches into three paradigms: sampling-based, feedback-driven, and search-oriented methods. The study systematically examines their applications across representative tasks, along with the associated benchmark datasets and evaluation protocols. Through a comprehensive survey and taxonomic analysis, this paper clarifies the conceptual landscape of multimodal TTS, identifies key research trends, and outlines promising future directions, thereby providing both a foundational reference and a roadmap for subsequent research in this domain.
📝 Abstract
Test-time Scaling (TTS) has emerged as a pivotal research direction for enhancing model performance by dynamically allocating computational resources during inference. Recent advancements have adapted this paradigm to Multimodal Foundation Models (MFMs), unlocking their potential in multimodal reasoning and generation. Despite rapid progress, the field lacks a systematic survey and unified theoretical framework to delineate the developmental landscape of multimodal TTS. To bridge this gap, we present the first comprehensive review of TTS research for MFMs, proposing a unified taxonomic framework that categorizes existing methodologies into three distinct strategies: sampling-based, feedback-based, and search-based approaches. We further summarize representative applications and benchmarks commonly utilized to evaluate multimodal TTS capabilities in generation and reasoning tasks. Finally, this survey discusses open challenges and outlines future research directions, providing a systematic roadmap for subsequent studies in this rapidly evolving field.