🤖 AI Summary
Conventional model scaling relies on parameter growth, leading to substantial increases in GPU memory consumption and inference latency.
Method: We propose ParScale, a parallel scaling paradigm that introduces P-way learnable transformations and dynamic aggregation during both training and inference—enabling computational parallelization without parameter inflation.
Contribution: We establish, for the first time, a theoretical scaling law linking parallel scale to performance, proving that P-way parallelism is asymptotically equivalent to only O(log P) parameter growth. ParScale supports zero-modification reuse of pretrained models, requiring only lightweight post-training adaptation for upgrading. Experiments demonstrate that, at comparable accuracy, ParScale reduces GPU memory usage by 22× and inference latency by 6× relative to conventional parameter-based scaling—significantly lowering both training and deployment costs.
📝 Abstract
It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$ imes$ less memory increase and 6$ imes$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.