๐ค AI Summary
Large language models typically execute all layers in a fixed, sequential depth, making it difficult to adapt their computational pathways dynamically based on input. This work proposes Programmed Layers (PoLar), a method that employs a lightweight predictor to generate, for each input, an execution program that dynamically skips or loops through pretrained layersโenabling flexible inference without any fine-tuning. PoLar is the first to reveal that dynamic execution programs are inherently present and directly exploitable within pretrained large models, thereby overcoming the limitations of conventional fixed-depth forward propagation. Experiments demonstrate that PoLar significantly improves accuracy on mathematical reasoning benchmarks, often surpassing standard and existing dynamic-depth approaches with fewer active layers, while maintaining robust performance on out-of-distribution data.
๐ Abstract
Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM's latent reasoning capacity.