๐ค AI Summary
This work addresses the challenges of deploying large language models under constraints of memory, latency, and hardware cost, where existing post-training compression methods lack a unified and efficient solution for algorithm selection, precision allocation, and hardware adaptation. We propose an open-source, hardware-aware automated compression framework that enables end-to-end model compression with a single command. The framework features automatic model analysis, mixed-precision planning, and staged progressive quantizationโfrom layers to blocks to the entire model. Innovatively, it establishes the first quantized checkpoint as a deployable baseline, ensuring all subsequent optimizations incrementally improve performance on the same model. This approach bridges algorithmic research and production deployment, significantly reducing resource overhead while preserving model accuracy, thereby enhancing the reproducibility and practicality of compression strategies.
๐ Abstract
Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven calibration strategies, and hardware-dependent execution regimes. We present OneComp, an open-source compression framework that transforms this expert workflow into a reproducible, resource-adaptive pipeline. Given a model identifier and available hardware, OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement. A key architectural choice is treating the first quantized checkpoint as a deployable pivot, ensuring that each subsequent stage improves the same model and that quality increases as more compute is invested. By converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline, OneComp bridges the gap between algorithmic innovation and production-grade model deployment.