Learn to Match: Two-Sided Matching with Temporally Extended Feedback

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limitations of traditional two-sided matching models, which assume static preferences and immediate feedback, thereby failing to capture the dynamic, interaction-driven nature of preference revelation in real-world settings. The authors formulate the matching process as a partially observable Markov game with delayed feedback and introduce a novel dynamic framework that incorporates costly screening, noisy observations, evolving latent preferences, and endogenous relationship formation or dissolution. They define “information friction loss” to quantify welfare losses stemming from incomplete preference revelation. Using multi-agent reinforcement learning (MARL), they solve the model with independent PPO and compare it against bandit-based approaches such as CA-ETC. Experiments show that independent PPO achieves higher cumulative social welfare and lower regret than baselines, yet incurs greater information friction loss, revealing a lack of coordinated exploration mechanisms inherent in bandit methods.

📝 Abstract

Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms.

Problem

Research questions and friction points this paper is trying to address.

two-sided matching

temporally extended feedback

dynamic matching markets

latent preferences

information revelation

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporally extended feedback

two-sided matching

multi-agent reinforcement learning