Learning Equilibria in Matching Games with Bandit Feedback

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the problem of jointly learning equilibria in two-sided matching markets with bandit feedback: agents on both sides adaptively select strategies under unknown zero-sum payoff structures, aiming to simultaneously learn a stable matching and the corresponding strategy equilibrium. We introduce “matching instability” as a novel regret metric for equilibrium learning and define a new solution concept—matching equilibrium. We propose the first UCB-type joint learning algorithm with sublinear, instance-independent regret guarantees. Theoretically, we prove that the algorithm achieves an $O(sqrt{T})$ regret upper bound over time horizon $T$ and converges almost surely to a matching equilibrium. Our work establishes the first provably convergent online learning framework for decentralized, intelligent decision-making in dynamic matching markets.

Technology Category

Application Category

📝 Abstract
We investigate the problem of learning an equilibrium in a generalized two-sided matching market, where agents can adaptively choose their actions based on their assigned matches. Specifically, we consider a setting in which matched agents engage in a zero-sum game with initially unknown payoff matrices, and we explore whether a centralized procedure can learn an equilibrium from bandit feedback. We adopt the solution concept of matching equilibrium, where a pair consisting of a matching $mathfrak{m}$ and a set of agent strategies $X$ forms an equilibrium if no agent has the incentive to deviate from $(mathfrak{m}, X)$. To measure the deviation of a given pair $(mathfrak{m}, X)$ from the equilibrium pair $(mathfrak{m}^star, X^star)$, we introduce matching instability that can serve as a regret measure for the corresponding learning problem. We then propose a UCB algorithm in which agents form preferences and select actions based on optimistic estimates of the game payoffs, and prove that it achieves sublinear, instance-independent regret over a time horizon $T$.
Problem

Research questions and friction points this paper is trying to address.

Learning equilibria in two-sided matching markets with adaptive agents
Centralized learning of equilibrium from bandit feedback in zero-sum games
Measuring deviation from equilibrium using matching instability as regret
Innovation

Methods, ideas, or system contributions that make the work stand out.

UCB algorithm for optimistic payoff estimates
Matching instability as regret measure
Learning equilibria from bandit feedback
🔎 Similar Papers
No similar papers found.