🤖 AI Summary
To address 6G’s demand for ultra-high compression ratios, conventional lossless compression methods have nearly reached their theoretical limits. This paper introduces LMCompress, the first framework to directly leverage large language models (LLMs) for universal lossless compression. It formulates compression as optimal induction over data distributions—approximating the incomputable Solomonoff induction. Methodologically, LMCompress integrates LLM-driven sequence modeling and probabilistic prediction, context-aware entropy coding, and model distillation with quantization for acceleration. Experiments demonstrate that LMCompress consistently outperforms state-of-the-art (SOTA) methods across diverse modalities: achieving ~2× higher compression ratios on JPEG-XL (images) and FLAC (audio), and ~4× improvement on bz2 (text), while also surpassing H.264 (video). These gains break classical information-theoretic bottlenecks and establish a novel lossless compression paradigm grounded in universal data understanding.
📝 Abstract
Modern data compression methods are slowly reaching their limits after 80 years of research, millions of papers, and wide range of applications. Yet, the extravagant 6G communication speed requirement raises a major open question for revolutionary new ideas of data compression. We have previously shown all understanding or learning are compression, under reasonable assumptions. Large language models (LLMs) understand data better than ever before. Can they help us to compress data? The LLMs may be seen to approximate the uncomputable Solomonoff induction. Therefore, under this new uncomputable paradigm, we present LMCompress. LMCompress shatters all previous lossless compression algorithms, doubling the lossless compression ratios of JPEG-XL for images, FLAC for audios, and H.264 for videos, and quadrupling the compression ratio of bz2 for texts. The better a large model understands the data, the better LMCompress compresses.