COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing image-text datasets suffer from insufficient fine-grained scene descriptions and incomplete scene coverage. To address this, we introduce PanCap—the first dataset jointly modeling panoptic segmentation and region-level grounded captioning. PanCap uniquely provides pixel-accurate panoptic masks strictly aligned with human-annotated, region-specific grounded captions, enabling unified evaluation of both visual understanding (e.g., panoptic segmentation) and grounded language generation (e.g., region-grounded captioning). Methodologically, we integrate COCONut’s panoptic extension, human-in-the-loop dense annotation, grounded caption modeling, and a multimodal joint-training framework. Experiments demonstrate that PanCap significantly improves vision-language models’ performance on visual understanding and text-to-image generation tasks, outperforming baselines across multiple metrics. These results validate the critical importance of fine-grained, scene-comprehensive, and strictly aligned grounded annotations for advancing multimodal learning.

Technology Category

Application Category

📝 Abstract

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.

Problem

Research questions and friction points this paper is trying to address.

Enhance panoptic segmentation and image captioning

Overcome limitations in detailed image-text datasets

Support improved training for vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Advanced COCONut panoptic masks

Fine-grained region-level captions

Human-edited densely annotated descriptions

🔎 Similar Papers

No similar papers found.

Authors to Follow