FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing open-vocabulary object detection methods treat query generation as a discriminative prediction task, which constrains the diversity and semantic flexibility of queries. This work proposes a generative perspective by introducing, for the first time, a continuous latent query dynamics mechanism that models decoder queries as a text-conditioned rectified flow. This approach progressively transforms text-agnostic queries into semantically aligned, text-guided queries in latent space, circumventing heuristic discrete construction. The framework is trained end-to-end jointly with a vision-language model and achieves 49.5 AP on COCO and 31.5 AP on LVIS, significantly outperforming GroundingDINO by +1.2 AP and +4.1 AP, respectively, thereby demonstrating its effectiveness and expressive capacity.

📝 Abstract

Open-vocabulary object detection (OVD) has achieved remarkable progress through large-scale vision-language pre-training. Existing methods, however, typically formulate OVD as a discriminative prediction problem, where decoder queries are either static or initialized from encoder features, thus limiting their diversity and flexibility. In this paper, we introduce a generative perspective by modeling decoder query generation as a continuous transport process in latent space. We propose FlowOVD, a text-conditioned query generation framework based on rectified flow that progressively transforms text-agnostic queries into text-guided queries. By introducing continuous latent query dynamics into a vision-language model (VLM) based detector, our method avoids heuristic discrete query construction and enables more expressive semantic alignment for open-vocabulary detection. Without requiring additional training data, FlowOVD achieves 49.5 AP on COCO and 31.5 AP on LVIS, outperforming GroundingDINO by +1.2 AP (+2.5 %) and +4.1 AP (+15.0 %), respectively. The larger gain on the challenging long-tailed LVIS benchmark further highlights the effectiveness of continuous query generation for open-vocabulary generalization.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary detection

decoder query generation

vision-language model

semantic alignment

zero-shot

Innovation

Methods, ideas, or system contributions that make the work stand out.

generative latent flows

open-vocabulary detection

rectified flow