🤖 AI Summary
This work addresses the challenge of heterogeneously integrating multiple utility signals—such as uncertainty, rarity, and diversity—in data subset selection. We propose the first data selection framework grounded in the Logarithmic Market Scoring Rule (LMSR) prediction market, which explicitly models and jointly optimizes multi-criteria utility signals via a unified cost function. Key innovations include token-level dynamic budget allocation (ρ = p/ℓ^γ), topic-wise normalization, and a lightweight diversity head—collectively ensuring high interpretability, coverage, and stability. Crucially, we pioneer the application of LMSR to data selection, enabling transparent, maximum-entropy signal fusion. Experiments demonstrate that our method achieves performance comparable to strong single-signal baselines on GSM8K using less than 0.1 GPU-hour. On AGNews under 5%–25% sampling rates, it significantly improves subset balance and robustness, effectively mitigating biases inherent in random sampling.
📝 Abstract
Selecting a small yet useful subset of training data is hard because signals of example utility (uncertainty, rarity, diversity, etc.) are heterogeneous and typically combined with ad hoc weights. We propose a market-based selector that prices each example via a cost-function prediction market (LMSR), signals act as traders, a single liquidity parameter controls concentration, and topic-wise normalization stabilizes calibration. Token budgets are handled explicitly by a price-per-token rule $ρ=p/ell^γ$, with $γ$ exposing an interpretable length bias; a lightweight diversity head improves coverage. We quantify coverage via topic cluster coverage and effective sample size. On the theory side, we show that LMSR implements a maximum-entropy aggregation with exponential weighting and a convex objective, yielding transparent knobs for aggregation strength. Empirically, on GSM8K (60k-token budget) the market with diversity achieves parity with strong single-signal baselines while reducing seed variance and incurring $<!0.1$ GPU-hr selection overhead; on AGNews at kept=5-25% the market (with light balancing) delivers competitive accuracy with improved balance and stability. The framework unifies multi-signal data curation under fixed compute for prompt-level reasoning and classification.