Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Successive Sub-value Q-learning (S2Q), a novel multi-agent reinforcement learning (MARL) approach that addresses the limitation of existing methods which rely solely on a single optimal action and often converge to suboptimal policies when value functions evolve dynamically. S2Q is the first MARL algorithm to explicitly model multiple sub-optimal value functions, preserving high-value alternative actions and integrating a Softmax behavioral policy to sustain effective exploration. Built upon a value decomposition framework, S2Q significantly enhances the ability to track dynamically shifting optimal policies while improving exploration efficiency. Empirical evaluations demonstrate that S2Q consistently outperforms state-of-the-art algorithms across several challenging MARL benchmarks, exhibiting superior adaptability and overall performance.

Technology Category

Application Category

📝 Abstract
Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.
Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning
value decomposition
shifting optima
suboptimal policies
adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent reinforcement learning
value decomposition
suboptimal actions
adaptive exploration
Successive Sub-value Q-learning
🔎 Similar Papers
No similar papers found.
Y
Yonghyeon Jo
Graduate School of Artificial Intelligence, Ulsan National Institute of Science and Technology (UNIST), Ulsan, South Korea 44919
Sunwoo Lee
Sunwoo Lee
Graduate School of AI, UNIST
Reinforcement Learning
Seungyul Han
Seungyul Han
Assistant Professor, Graduate School of AI, UNIST
Reinforcement LearningMachine LearningIntelligent ControlSignal Processing