FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language datasets suffer from a scarcity of fine-grained hard negative samples, which limits the discriminative capacity of models. To address this, this work proposes FineGen, a novel multi-agent closed-loop framework grounded in vision-language models that automatically synthesizes attribute-level hard negatives through an iterative generate–verify–refine pipeline, ensuring semantic validity and strict contradiction with image content. The framework also incorporates a strategy to control the ratio of positive to negative samples. Using this approach, the authors construct FineGen-100K, a dataset comprising 147,000 high-quality samples with 96.7% attribute validity. Evaluated on the FG-OVD benchmark, models trained with FineGen-100K achieve a substantial 14.4% improvement in downstream task accuracy.
📝 Abstract
The scarcity of hard negative samples in current vision-language datasets significantly hinders fine-grained perception. To address this, we propose FineGen, a VLM-based Multi-Agent framework for automated dataset construction. By employing a collaborative Generation-Verification-Correction pipeline with a closed-loop feedback mechanism, FineGen ensures synthesized hard negatives are semantically valid yet strictly contradictory to visual content. Applying this to ImageNet, we construct FineGen-100K, a hierarchical dataset containing over 147,000 attribute-specific hard negatives with a rigorous 1:10 positive-to-negative ratio. Extensive evaluations confirm a 96.7% attribute validity rate. Crucially, downstream validation on the FG-OVD benchmark shows that fine-tuning on FineGen-100K yields a substantial +14.4% accuracy improvement on hard samples, significantly outperforming state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

fine-grained perception
hard negative samples
vision-language datasets
image-text dataset construction
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained perception
hard negative generation
vision-language model
multi-agent framework
dataset construction