Scaling Laws in the Tiny Regime: How Small Models Change Their Mistakes

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This study addresses the unclear scaling laws of small models with fewer than 20 million parameters in TinyML and edge AI. The authors systematically train 90 width-varying, depth-fixed ScaleCNN and MobileNetV2 models (ranging from 22K to 19.8M parameters) on CIFAR-100 and conduct controlled experiments to analyze error-rate scaling, error structure, class specialization, and calibration behavior. They uncover, for the first time, that within extremely small parameter regimes, these models follow a steeper error-rate power law (α ≈ 0.156/0.106), exhibit low error-set overlap (as low as 0.35), and tend to specialize in easier classes. Notably, the smallest models achieve the best calibration (ECE = 0.013), highlighting the limitations of aggregate accuracy metrics for edge deployment scenarios.

Technology Category

Application Category

📝 Abstract

Neural scaling laws describe how model performance improves as a power law with size, but existing work focuses on models above 100M parameters. The sub-20M regime -- where TinyML and edge AI operate -- remains unexamined. We train 90 models (22K--19.8M parameters) across two architectures (plain ConvNet, MobileNetV2) on CIFAR-100, varying width while holding depth and training fixed. Both follow approximate power laws in error rate: $\alpha = 0.156 \pm 0.002$ (ScaleCNN) and $\alpha = 0.106 \pm 0.001$ (MobileNetV2) across five seeds. Since prior work fit cross-entropy loss rather than error rate, direct exponent comparison is approximate; with that caveat, these are 1.4--2x steeper than $\alpha \approx 0.076$ for large language models. The power law does not hold uniformly: local exponents decay with scale, and MobileNetV2 saturates at 19.8M parameters ($\alpha_{\mathrm{local}} = 0.006$). Error structure also changes. Jaccard overlap between error sets of the smallest and largest ScaleCNN is only 0.35 (25 seed pairs, $\pm 0.004$) -- compression changes which inputs are misclassified, not merely how many. Small models concentrate capacity on easy classes (Gini: 0.26 at 22K vs. 0.09 at 4.7M) while abandoning the hardest (bottom-5 accuracy: 10% vs. 53%). Counter to expectation, the smallest models are best calibrated (ECE = 0.013 vs. peak 0.110 at mid-size). Aggregate accuracy is therefore misleading for edge deployment; validation must happen at the target model size.

Problem

Research questions and friction points this paper is trying to address.

neural scaling laws

TinyML

edge AI

error structure

model calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

neural scaling laws

TinyML

error structure