CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unifying three heterogeneous tasks—salient object detection (SOD), co-salient object detection (CoSOD), and salient instance segmentation (SIS)—within a single model for saliency detection. We propose the first vision-language model (VLM)-based chain-of-thought (CoT) framework for unified modeling. Our key contributions are: (1) formalizing all three tasks as multi-step reasoning processes within VLMs to bridge semantic gaps across tasks; (2) introducing confidence-guided preference optimization (CGPO), which uses the difference between reward and model confidence as per-sample advantage signals, mitigating reward sparsity and high computational cost in reinforcement learning; and (3) adopting a two-stage CoT training paradigm—supervised fine-tuning followed by RL—and incorporating output-to-reasoning data construction to ensure logical consistency. Experiments demonstrate state-of-the-art performance under extreme low-data regimes: our method matches or surpasses task-specific SOTAs and closed-source large models, achieving an S-measure of 0.899 on CoCA for CoSOD—8.0 percentage points higher than prior best.

Technology Category

Application Category

📝 Abstract
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a lightweight single-sample algorithm that leverages the discrepancy between reward and model confidence as a per-sample advantage signal. This design naturally focuses updates on informative responses while eliminating group sampling, thereby addressing GRPO's key limitations: confidence-agnostic learning, signal dilution, and prohibitive computational overhead. We also introduce an "output-to-reasoning" strategy to construct high-fidelity SFT data that ensures logical consistency with ground-truth masks. Experiments show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for CoSOD, surpassing the prior best by 8.0 percentage points, despite using far less training data.
Problem

Research questions and friction points this paper is trying to address.

Unifying heterogeneous saliency tasks through Chain-of-Thought reasoning
Addressing limitations in reinforcement learning with confidence-guided optimization
Constructing logically consistent training data for improved saliency detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified CoT framework for heterogeneous saliency tasks
Confidence-Guided Policy Optimization for RL training
Output-to-reasoning strategy for logical SFT data
🔎 Similar Papers
No similar papers found.
Long Li
Long Li
Research Staff Member, Inspur Group Co., Ltd.
Software Defined NetworkingNetwork Performance Optimization
S
Shuichen Ji
Northwestern Polytechnical University
Ziyang Luo
Ziyang Luo
Salesforce AI Research
AgentsLLMsMultimodal
N
Nian Liu
Northwestern Polytechnical University
D
Dingwen Zhang
Northwestern Polytechnical University
J
Junwei Han
Northwestern Polytechnical University