FlexCap: Describe Anything in Images in Controllable Detail

📅 2024-03-18

📈 Citations: 2

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing vision-language models lack granularity control, making it difficult to generate image descriptions at user-specified levels of detail. To address this, we propose FlexCap—the first vision-language model supporting length-controllable, multi-granularity region captioning. Our key contributions are: (1) a novel length-conditioned region captioning paradigm; (2) a large-scale, multi-length weakly supervised region caption dataset, coupled with a region-localization-guided knowledge distillation strategy for efficient training; and (3) joint modeling of visual features and target caption length. Experiments demonstrate that FlexCap achieves state-of-the-art (SOTA) performance on the Visual Genome dense captioning task and establishes new SOTA results on zero-shot VQA benchmarks—including GQA and VQAv2. Moreover, FlexCap seamlessly supports diverse downstream applications such as image annotation, fine-grained attribute recognition, and vision-language dialogue.

Technology Category

Application Category

📝 Abstract

We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .

Problem

Research questions and friction points this paper is trying to address.

Image Captioning

Variable Detail Level

Flexibility in Description

Innovation

Methods, ideas, or system contributions that make the work stand out.

FlexCap

Dense Captioning

Zero-shot Learning

🔎 Similar Papers

No similar papers found.