🤖 AI Summary
To address the prohibitive computational and memory costs of hyperparameter optimization (HPO) for large models, this work introduces *layer freezing count* as a novel, scalable, and memory-efficient fidelity dimension for multi-fidelity HPO (MF-HPO). Unlike conventional fidelity proxies—such as training iterations or data subsampling—which degrade severely under low-budget regimes, we systematically demonstrate, for the first time, strong rank correlation between hyperparameter performance rankings obtained with varying numbers of frozen layers in both ResNet and Transformer architectures. This enables reliable early-stage performance estimation while drastically reducing GPU memory consumption and training overhead. Moreover, layer freezing supports joint optimization with hardware resources (e.g., GPU count), facilitating hardware-aware HPO. Our approach establishes a new paradigm for resource-efficient, hardware-informed HPO and provides both theoretical grounding and practical methodology for scalable hyperparameter tuning of large-scale models.
📝 Abstract
As model sizes grow, finding efficient and cost-effective hyperparameter optimization (HPO) methods becomes increasingly crucial for deep learning pipelines. While multi-fidelity HPO (MF-HPO) trades off computational resources required for DL training with lower fidelity estimations, existing fidelity sources often fail under lower compute and memory constraints. We propose a novel fidelity source: the number of layers that are trained or frozen during training. For deep networks, this approach offers significant compute and memory savings while preserving rank correlations between hyperparameters at low fidelities compared to full model training. We demonstrate this in our empirical evaluation across ResNets and Transformers and additionally analyze the utility of frozen layers as a fidelity in using GPU resources as a fidelity in HPO, and for a combined MF-HPO with other fidelity sources. This contribution opens new applications for MF-HPO with hardware resources as a fidelity and creates opportunities for improved algorithms navigating joint fidelity spaces.