🤖 AI Summary
Existing image-text datasets suffer from insufficient fine-grained scene descriptions and incomplete scene coverage. To address this, we introduce PanCap—the first dataset jointly modeling panoptic segmentation and region-level grounded captioning. PanCap uniquely provides pixel-accurate panoptic masks strictly aligned with human-annotated, region-specific grounded captions, enabling unified evaluation of both visual understanding (e.g., panoptic segmentation) and grounded language generation (e.g., region-grounded captioning). Methodologically, we integrate COCONut’s panoptic extension, human-in-the-loop dense annotation, grounded caption modeling, and a multimodal joint-training framework. Experiments demonstrate that PanCap significantly improves vision-language models’ performance on visual understanding and text-to-image generation tasks, outperforming baselines across multiple metrics. These results validate the critical importance of fine-grained, scene-comprehensive, and strictly aligned grounded annotations for advancing multimodal learning.
📝 Abstract
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.