🤖 AI Summary
This paper challenges the prevailing paradigm of multimodal large language models (MLLMs) that rely heavily on large-scale pixel-level supervision, highlighting their weak performance on core tasks such as visual question answering (VQA) and even degradation of intrinsic visual localization capability. Method: We introduce two novel, challenging benchmarks to systematically investigate *when* visual localization ability emerges *without pixel-level supervision*, revealing its strong dependence on object-part structure and position/appearance representations. We further propose a lightweight, plug-and-play pixel information extraction module, coupled with zero-supervision localization enhancement and a cross-task generalization evaluation framework. Contribution/Results: Our baseline model—trained without any pixel-level annotations—achieves significant improvements over state-of-the-art methods on both pixel-level localization and VQA. All code and benchmarks are publicly released to stimulate critical reflection on the development trajectory of pixel-level foundation models.
📝 Abstract
Multiple works have emerged to push the boundaries on multi-modal large language models (MLLMs) towards pixel-level understanding. Such approaches have shown strong performance on benchmarks for referring expression segmentation and grounded conversation generation. The current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data. However, we show that such MLLMs when evaluated on recent challenging vision centric benchmarks, exhibit a weak ability in visual question answering. Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such supervision. In this work, we propose two novel challenging benchmarks and show that MLLMs without pixel-level grounding supervision can outperform the state of the art in such tasks when evaluating both the pixel-level grounding and visual question answering. We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation. More importantly, we study the research question of ``When does grounding emerge in MLLMs that are not trained with pixel-level grounding supervision?'' We show that grounding can coincide with object parts or location/appearance information. Code repository is at https://github.com/MSiam/PixFoundation/.