Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of systematic investigation into test-time scaling (TTS) for multimodal foundation models during inference. It establishes the first unified theoretical framework for this emerging area, categorizing existing approaches into three paradigms: sampling-based, feedback-driven, and search-oriented methods. The study systematically examines their applications across representative tasks, along with the associated benchmark datasets and evaluation protocols. Through a comprehensive survey and taxonomic analysis, this paper clarifies the conceptual landscape of multimodal TTS, identifies key research trends, and outlines promising future directions, thereby providing both a foundational reference and a roadmap for subsequent research in this domain.

📝 Abstract

Test-time Scaling (TTS) has emerged as a pivotal research direction for enhancing model performance by dynamically allocating computational resources during inference. Recent advancements have adapted this paradigm to Multimodal Foundation Models (MFMs), unlocking their potential in multimodal reasoning and generation. Despite rapid progress, the field lacks a systematic survey and unified theoretical framework to delineate the developmental landscape of multimodal TTS. To bridge this gap, we present the first comprehensive review of TTS research for MFMs, proposing a unified taxonomic framework that categorizes existing methodologies into three distinct strategies: sampling-based, feedback-based, and search-based approaches. We further summarize representative applications and benchmarks commonly utilized to evaluate multimodal TTS capabilities in generation and reasoning tasks. Finally, this survey discusses open challenges and outlines future research directions, providing a systematic roadmap for subsequent studies in this rapidly evolving field.

Problem

Research questions and friction points this paper is trying to address.

Test-Time Scaling

Multimodal Foundation Models

systematic survey

unified framework

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Scaling

Multimodal Foundation Models

Taxonomic Framework