DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

📅 2024-05-25
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal models overlook task-specific resolution requirements in image-region-to-language alignment, leading to imprecise referring expressions. To address this, we propose a dynamic resolution modeling framework featuring a novel, human visual cognition-inspired stochastic nested multi-scale view mechanism. During training, it constructs multi-resolution images and applies stochastic nested sampling to achieve resolution-robust alignment. At inference, it adaptively selects the optimal view based on task semantics and image priors, while introducing cross-scale language–vision alignment. Evaluated on three core tasks—region captioning, open-vocabulary region recognition, and attribute detection—this single model achieves state-of-the-art performance across all, significantly enhancing multi-task synergy. It is the first work to realize resolution-aware fine-grained cross-modal alignment.

Technology Category

Application Category

📝 Abstract
One fundamental task of multimodal models is to translate referred image regions to human preferred language descriptions. Existing methods, however, ignore the resolution adaptability needs of different tasks, which hinders them to find out precise language descriptions. In this study, we propose a DynRefer approach, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. During training, DynRefer stochastically aligns language descriptions of multimodal tasks with images of multiple resolutions, which are constructed by nesting a set of random views around the referred region. During inference, DynRefer performs selectively multimodal referring by sampling proper region representations for tasks from the nested views based on image and task priors. This allows the visual information for referring to better match human preferences, thereby improving the representational adaptability of region-level multimodal models. Experiments show that DynRefer brings mutual improvement upon broad tasks including region-level captioning, open-vocabulary region recognition and attribute detection. Furthermore, DynRefer achieves state-of-the-art results on multiple region-level multimodal tasks using a single model. Code is available at https://github.com/callsys/DynRefer.
Problem

Research questions and friction points this paper is trying to address.

Improves region-level multimodal tasks accuracy
Adapts resolution for precise language descriptions
Enhances visual information matching human preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic resolution adaptation for multimodal tasks
Stochastic alignment of language and image resolutions
Selective multimodal referring based on task priors
🔎 Similar Papers
No similar papers found.
Y
Yuzhong Zhao
University of Chinese Academy of Sciences
F
Feng Liu
University of Chinese Academy of Sciences
Y
Yue Liu
University of Chinese Academy of Sciences
M
Mingxiang Liao
University of Chinese Academy of Sciences
C
Chen Gong
University of Virginia
Qixiang Ye
Qixiang Ye
University of Chinese Academy of Sciences, University of Maryland
Visual Object DetectionImage Processing
F
Fang Wan
University of Chinese Academy of Sciences