CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of unifying three heterogeneous tasks—salient object detection (SOD), co-salient object detection (CoSOD), and salient instance segmentation (SIS)—within a single model for saliency detection. We propose the first vision-language model (VLM)-based chain-of-thought (CoT) framework for unified modeling. Our key contributions are: (1) formalizing all three tasks as multi-step reasoning processes within VLMs to bridge semantic gaps across tasks; (2) introducing confidence-guided preference optimization (CGPO), which uses the difference between reward and model confidence as per-sample advantage signals, mitigating reward sparsity and high computational cost in reinforcement learning; and (3) adopting a two-stage CoT training paradigm—supervised fine-tuning followed by RL—and incorporating output-to-reasoning data construction to ensure logical consistency. Experiments demonstrate state-of-the-art performance under extreme low-data regimes: our method matches or surpasses task-specific SOTAs and closed-source large models, achieving an S-measure of 0.899 on CoCA for CoSOD—8.0 percentage points higher than prior best.

Technology Category

Application Category

📝 Abstract

We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a lightweight single-sample algorithm that leverages the discrepancy between reward and model confidence as a per-sample advantage signal. This design naturally focuses updates on informative responses while eliminating group sampling, thereby addressing GRPO's key limitations: confidence-agnostic learning, signal dilution, and prohibitive computational overhead. We also introduce an "output-to-reasoning" strategy to construct high-fidelity SFT data that ensures logical consistency with ground-truth masks. Experiments show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for CoSOD, surpassing the prior best by 8.0 percentage points, despite using far less training data.

Problem

Research questions and friction points this paper is trying to address.

Unifying heterogeneous saliency tasks through Chain-of-Thought reasoning

Addressing limitations in reinforcement learning with confidence-guided optimization

Constructing logically consistent training data for improved saliency detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified CoT framework for heterogeneous saliency tasks

Confidence-Guided Policy Optimization for RL training

Output-to-reasoning strategy for logical SFT data

🔎 Similar Papers

No similar papers found.

Authors to Follow