DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-Vocabulary Multi-Label Recognition (OV-MLR) demands both fine-grained object localization and explicit modeling of inter-class semantic relationships. However, existing vision-language pretraining (VLP) models suffer from insufficient localization accuracy under weak supervision and lack mechanisms for structured inter-class relationship modeling. To address these limitations, we propose a dual-adaptive refinement framework: (1) a structured class-relation graph—automatically mined by a large language model—is integrated with a graph attention network to jointly optimize intra-class localization and enable inter-class knowledge transfer; (2) a weakly supervised patch selection loss is introduced to enhance feature discriminability without finetuning the VLP backbone. Our method achieves state-of-the-art performance across multiple benchmarks, notably improving recognition accuracy for unseen classes. It effectively overcomes two key bottlenecks in VLP-based OV-MLR: weak-supervision localization fidelity and structured inter-class relationship modeling.

Technology Category

Application Category

📝 Abstract
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Enhances open-vocabulary multi-label recognition via adaptive refinement.
Improves fine-grained localization under weak supervision constraints.
Integrates LLM-derived relational knowledge for inter-class dependency modeling.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Refinement Module for intra-class localization
Class Relationship Graph from LLM for inter-class transfer
Weakly Supervised Patch Selecting loss for discriminative localization
🔎 Similar Papers
No similar papers found.