Lossless data compression by large models

📅 2024-06-24

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

To address 6G’s demand for ultra-high compression ratios, conventional lossless compression methods have nearly reached their theoretical limits. This paper introduces LMCompress, the first framework to directly leverage large language models (LLMs) for universal lossless compression. It formulates compression as optimal induction over data distributions—approximating the incomputable Solomonoff induction. Methodologically, LMCompress integrates LLM-driven sequence modeling and probabilistic prediction, context-aware entropy coding, and model distillation with quantization for acceleration. Experiments demonstrate that LMCompress consistently outperforms state-of-the-art (SOTA) methods across diverse modalities: achieving ~2× higher compression ratios on JPEG-XL (images) and FLAC (audio), and ~4× improvement on bz2 (text), while also surpassing H.264 (video). These gains break classical information-theoretic bottlenecks and establish a novel lossless compression paradigm grounded in universal data understanding.

Technology Category

Application Category

📝 Abstract

Modern data compression methods are slowly reaching their limits after 80 years of research, millions of papers, and wide range of applications. Yet, the extravagant 6G communication speed requirement raises a major open question for revolutionary new ideas of data compression. We have previously shown all understanding or learning are compression, under reasonable assumptions. Large language models (LLMs) understand data better than ever before. Can they help us to compress data? The LLMs may be seen to approximate the uncomputable Solomonoff induction. Therefore, under this new uncomputable paradigm, we present LMCompress. LMCompress shatters all previous lossless compression algorithms, doubling the lossless compression ratios of JPEG-XL for images, FLAC for audios, and H.264 for videos, and quadrupling the compression ratio of bz2 for texts. The better a large model understands the data, the better LMCompress compresses.

Problem

Research questions and friction points this paper is trying to address.

Exploring LLMs for revolutionary lossless data compression

Surpassing limits of traditional compression methods with LMCompress

Linking model understanding to improved compression ratios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large language models for data compression

LMCompress outperforms traditional compression algorithms

Links model understanding to compression efficiency

🔎 Similar Papers

MCNC: Manifold-Constrained Reparameterization for Neural Compression