🤖 AI Summary
Neural scaling laws lack uncertainty quantification, hindering risk-aware decision-making during extrapolation. Method: We propose the first Bayesian meta-learning framework based on Prior-data Fitted Networks (PFNs) for scaling law modeling. Our approach innovatively leverages PFNs to synthesize principled, sampleable prior distributions over scaling functions—enabling probabilistic, physics-informed extrapolation. Contribution/Results: Compared to conventional point estimates and existing Bayesian methods, our framework achieves significantly improved extrapolation accuracy and predictive calibration on real-world scaling data. It attains state-of-the-art performance in low-data regimes—particularly in Bayesian active learning—demonstrating robustness where data is scarce. By providing well-calibrated uncertainty estimates, our method establishes a new paradigm for trustworthy deployment of scaling laws in safety-critical applications.
📝 Abstract
Scaling has been a major driver of recent advancements in deep learning. Numerous empirical studies have found that scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales. However, existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications involving decision-making problems such as determining the expected performance improvements achievable by investing additional computational resources. In this work, we explore a Bayesian framework based on Prior-data Fitted Networks (PFNs) for neural scaling law extrapolation. Specifically, we design a prior distribution that enables the sampling of infinitely many synthetic functions resembling real-world neural scaling laws, allowing our PFN to meta-learn the extrapolation. We validate the effectiveness of our approach on real-world neural scaling laws, comparing it against both the existing point estimation methods and Bayesian approaches. Our method demonstrates superior performance, particularly in data-limited scenarios such as Bayesian active learning, underscoring its potential for reliable, uncertainty-aware extrapolation in practical applications.