🤖 AI Summary
Existing multimodal large language models (MLLMs) for image regression rely on predefined vocabularies and generic prompt tuning, yet empirical results show they fail to effectively leverage textual semantics and underperform unimodal baselines. We identify two root causes: discrete tokenization-induced quantization error and semantically impoverished prompts.
Method: We propose RvTC—a Transformer-based unified classification-regression framework that eliminates fixed vocabularies. It introduces data-adaptive binning and semantically rich, task-specific prompts, enabling end-to-end cross-modal joint optimization.
Contribution/Results: RvTC removes discretization error without architectural complexity. Evaluated on four image quality assessment benchmarks—including AVA—it achieves significant gains (e.g., Pearson correlation on AVA rises from 0.83 to 0.90). This is the first work to empirically validate the critical performance gain of semantic prompting in multimodal regression.
📝 Abstract
Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., "How would you rate this image?"), assuming this mimics human rating behavior. Our analysis reveals these approaches provide no benefit over image-only training. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input. We propose Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images. More importantly, we demonstrate that data-specific prompts dramatically improve performance. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts improves correlations from 0.83 to 0.90, a new state-of-the-art. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information surpassing mere statistical biases. This underscores the importance of incorporating meaningful textual context in multimodal regression tasks.