Market-Based Data Subset Selection -- Principled Aggregation of Multi-Criteria Example Utility

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of heterogeneously integrating multiple utility signals—such as uncertainty, rarity, and diversity—in data subset selection. We propose the first data selection framework grounded in the Logarithmic Market Scoring Rule (LMSR) prediction market, which explicitly models and jointly optimizes multi-criteria utility signals via a unified cost function. Key innovations include token-level dynamic budget allocation (ρ = p/ℓ^γ), topic-wise normalization, and a lightweight diversity head—collectively ensuring high interpretability, coverage, and stability. Crucially, we pioneer the application of LMSR to data selection, enabling transparent, maximum-entropy signal fusion. Experiments demonstrate that our method achieves performance comparable to strong single-signal baselines on GSM8K using less than 0.1 GPU-hour. On AGNews under 5%–25% sampling rates, it significantly improves subset balance and robustness, effectively mitigating biases inherent in random sampling.

Technology Category

Application Category

📝 Abstract

Selecting a small yet useful subset of training data is hard because signals of example utility (uncertainty, rarity, diversity, etc.) are heterogeneous and typically combined with ad hoc weights. We propose a market-based selector that prices each example via a cost-function prediction market (LMSR), signals act as traders, a single liquidity parameter controls concentration, and topic-wise normalization stabilizes calibration. Token budgets are handled explicitly by a price-per-token rule $ρ=p/ell^γ$, with $γ$ exposing an interpretable length bias; a lightweight diversity head improves coverage. We quantify coverage via topic cluster coverage and effective sample size. On the theory side, we show that LMSR implements a maximum-entropy aggregation with exponential weighting and a convex objective, yielding transparent knobs for aggregation strength. Empirically, on GSM8K (60k-token budget) the market with diversity achieves parity with strong single-signal baselines while reducing seed variance and incurring $<!0.1$ GPU-hr selection overhead; on AGNews at kept=5-25% the market (with light balancing) delivers competitive accuracy with improved balance and stability. The framework unifies multi-signal data curation under fixed compute for prompt-level reasoning and classification.

Problem

Research questions and friction points this paper is trying to address.

Selecting optimal training subsets by pricing heterogeneous utility signals

Aggregating multi-criteria example utility through market mechanisms

Unifying data curation under fixed compute constraints for reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Market-based pricing via prediction market for data selection

Single liquidity parameter controls selection concentration

Token budget handling with price-per-token length bias

🔎 Similar Papers

No similar papers found.

Authors to Follow