🤖 AI Summary
This work addresses the common practice of optimizing large language model training in isolation due to constraints in data, memory, and compute resources, which lacks a holistic perspective. The authors propose a unified resource-aware decision framework that systematically integrates data efficiency, memory compression, and compute budget-aware strategies. They demonstrate that optimal data selection critically depends on both task objectives and resource constraints, and reveal that memory—not computational power—is often the primary bottleneck in fine-tuning. By combining learning-dynamics-, gradient-, and influence-based data pruning with adaptive stopping criteria and inference allocation mechanisms, the framework establishes a systematic approach for efficient training and deployment under limited resources, substantially improving overall resource utilization efficiency.
📝 Abstract
Resource constraints increasingly determine what can be trained, fine-tuned, and deployed in large language models (LLMs), yet efficiency is often studied through isolated techniques rather than as an interacting system of limits. This survey adopts a constraint-centric perspective and organizes recent progress around three coupled bottlenecks: data efficiency (what to train on), memory efficiency (how to fit training), and compute budget awareness (when and where to spend FLOPs). On the data axis, we review selection and pruning methods that maximize learning per token, ranging from scalable proxy signals based on learning dynamics to gradient- and influence-based scoring, as well as difficulty-aware and curriculum-style strategies. We highlight emerging evidence that different notions of good data dominate in different regimes, implying that optimal subsets depend on the task objective and resource budget rather than being universal. On the systems side, we show that GPU memory, not raw compute, is often the dominant bottleneck in fine-tuning, and that effective scaling requires jointly reducing weight storage, optimizer states, and activation memory rather than optimizing any single component in isolation. Beyond memory, we frame training and inference as compute-governed processes in which optimization, data selection, and decoding must explicitly account for finite FLOP budgets. We review evidence for compute-optimal allocation and stopping rules, where computation should be halted or reallocated once marginal performance gains fall below a budget-dependent threshold. Together, these results unify compute-aware data selection, scaling laws, and adaptive inference under a common principle of resource-conditioned decision-making.