Market-Based Data Subset Selection -- Principled Aggregation of Multi-Criteria Example Utility

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of heterogeneously integrating multiple utility signals—such as uncertainty, rarity, and diversity—in data subset selection. We propose the first data selection framework grounded in the Logarithmic Market Scoring Rule (LMSR) prediction market, which explicitly models and jointly optimizes multi-criteria utility signals via a unified cost function. Key innovations include token-level dynamic budget allocation (ρ = p/ℓ^γ), topic-wise normalization, and a lightweight diversity head—collectively ensuring high interpretability, coverage, and stability. Crucially, we pioneer the application of LMSR to data selection, enabling transparent, maximum-entropy signal fusion. Experiments demonstrate that our method achieves performance comparable to strong single-signal baselines on GSM8K using less than 0.1 GPU-hour. On AGNews under 5%–25% sampling rates, it significantly improves subset balance and robustness, effectively mitigating biases inherent in random sampling.

Technology Category

Application Category

📝 Abstract
Selecting a small yet useful subset of training data is hard because signals of example utility (uncertainty, rarity, diversity, etc.) are heterogeneous and typically combined with ad hoc weights. We propose a market-based selector that prices each example via a cost-function prediction market (LMSR), signals act as traders, a single liquidity parameter controls concentration, and topic-wise normalization stabilizes calibration. Token budgets are handled explicitly by a price-per-token rule $ρ=p/ell^γ$, with $γ$ exposing an interpretable length bias; a lightweight diversity head improves coverage. We quantify coverage via topic cluster coverage and effective sample size. On the theory side, we show that LMSR implements a maximum-entropy aggregation with exponential weighting and a convex objective, yielding transparent knobs for aggregation strength. Empirically, on GSM8K (60k-token budget) the market with diversity achieves parity with strong single-signal baselines while reducing seed variance and incurring $<!0.1$ GPU-hr selection overhead; on AGNews at kept=5-25% the market (with light balancing) delivers competitive accuracy with improved balance and stability. The framework unifies multi-signal data curation under fixed compute for prompt-level reasoning and classification.
Problem

Research questions and friction points this paper is trying to address.

Selecting optimal training subsets by pricing heterogeneous utility signals
Aggregating multi-criteria example utility through market mechanisms
Unifying data curation under fixed compute constraints for reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Market-based pricing via prediction market for data selection
Single liquidity parameter controls selection concentration
Token budget handling with price-per-token length bias
🔎 Similar Papers
No similar papers found.
A
Ashish Jha
Skolkovo Institute of Science and Technology
Valentin Leplat
Valentin Leplat
Innopolis University
Matrix/Tensor factorizationNumerical linear algebraConvex and Non-Convex Numerical Optimization
A
Anh Huy Phan
Skolkovo Institute of Science and Technology