Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal models for camouflaged object recognition deviate significantly from human visual mechanisms, failing to leverage foreground-background similarity for concealed target localization. To address this, we propose a “visual refocusing” reinforcement framework that emulates human-like progressive attention reallocation and dynamic prior-knowledge invocation. Our method establishes a hierarchical attention transfer mechanism and a reasoning-token-driven dynamic bounding-box optimization scheme. It integrates policy-optimized reinforcement fine-tuning, multi-step visual reasoning modeling, and similarity-guided attention modulation. Experiments demonstrate substantial improvements over supervised fine-tuning baselines on both camouflaged classification and detection tasks—and, for the first time, achieve comprehensive superiority over human performance. Visualization confirms continuous, reasoning-token-triggered bounding-box evolution and human-like visual refocusing behavior.

Technology Category

Application Category

📝 Abstract
Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus' visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior cognitive knowledge. In this paper, we propose a visual refocus reinforcement framework via the policy optimization algorithm to encourage multi-modal models to think and refocus more before answering, and achieve excellent reasoning abilities to align and even surpass human camouflaged perception systems. Our extensive experiments on camouflaged perception successfully demonstrate the emergence of refocus visual phenomena, characterized by multiple reasoning tokens and dynamic adjustment of the detection box. Besides, experimental results on both camouflaged object classification and detection tasks exhibit significantly superior performance compared to Supervised Fine-Tuning (SFT) baselines.
Problem

Research questions and friction points this paper is trying to address.

Misalignment between multi-modal models and human visual perception in identifying camouflaged objects
Inability of models to emulate human cognitive processes for concealed object detection
Need for hierarchical attention shifting and dynamic knowledge refinement in object localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive visual refocus reinforcement framework
Hierarchical attention shifting with dynamic adjustment
Policy optimization for multi-modal reasoning
🔎 Similar Papers
No similar papers found.
R
Ruolin Shen
Technische Universität München
Xiaozhong Ji
Xiaozhong Ji
Nanjing University
computer visionimage processingsuper resolution
W
WU Kai
ByteDance
J
Jiangning Zhang
Zhejiang University
Y
Yijun He
ByteDance
H
HaiHua Yang
ByteDance
Xiaobin Hu
Xiaobin Hu
Tencent Youtu Lab;Technische Universität München (TUM)
Deep learningComputer visionVLMAgents
X
Xiaoyu Sun
Australian National University