Lossless data compression by large models

📅 2024-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address 6G’s demand for ultra-high compression ratios, conventional lossless compression methods have nearly reached their theoretical limits. This paper introduces LMCompress, the first framework to directly leverage large language models (LLMs) for universal lossless compression. It formulates compression as optimal induction over data distributions—approximating the incomputable Solomonoff induction. Methodologically, LMCompress integrates LLM-driven sequence modeling and probabilistic prediction, context-aware entropy coding, and model distillation with quantization for acceleration. Experiments demonstrate that LMCompress consistently outperforms state-of-the-art (SOTA) methods across diverse modalities: achieving ~2× higher compression ratios on JPEG-XL (images) and FLAC (audio), and ~4× improvement on bz2 (text), while also surpassing H.264 (video). These gains break classical information-theoretic bottlenecks and establish a novel lossless compression paradigm grounded in universal data understanding.

Technology Category

Application Category

📝 Abstract
Modern data compression methods are slowly reaching their limits after 80 years of research, millions of papers, and wide range of applications. Yet, the extravagant 6G communication speed requirement raises a major open question for revolutionary new ideas of data compression. We have previously shown all understanding or learning are compression, under reasonable assumptions. Large language models (LLMs) understand data better than ever before. Can they help us to compress data? The LLMs may be seen to approximate the uncomputable Solomonoff induction. Therefore, under this new uncomputable paradigm, we present LMCompress. LMCompress shatters all previous lossless compression algorithms, doubling the lossless compression ratios of JPEG-XL for images, FLAC for audios, and H.264 for videos, and quadrupling the compression ratio of bz2 for texts. The better a large model understands the data, the better LMCompress compresses.
Problem

Research questions and friction points this paper is trying to address.

Exploring LLMs for revolutionary lossless data compression
Surpassing limits of traditional compression methods with LMCompress
Linking model understanding to improved compression ratios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large language models for data compression
LMCompress outperforms traditional compression algorithms
Links model understanding to compression efficiency
🔎 Similar Papers
No similar papers found.
Z
Ziguang Li
Institute of Computing Technology, Chinese Academy of Science, Beijing, China; Peng Cheng Laboratory, Shenzhen, China; Zhongyuan Institute of Artificial Intelligence, Zhengzhou, China
C
Chao Huang
Institute of Computing Technology, Chinese Academy of Science, Beijing, China; Peng Cheng Laboratory, Shenzhen, China; Zhongyuan Institute of Artificial Intelligence, Zhengzhou, China
X
Xuliang Wang
Institute of Computing Technology, Chinese Academy of Science, Beijing, China; Peng Cheng Laboratory, Shenzhen, China; Zhongyuan Institute of Artificial Intelligence, Zhengzhou, China
H
Haibo Hu
Institute of Computing Technology, Chinese Academy of Science, Beijing, China
C
Cole Wyeth
School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
Dongbo Bu
Dongbo Bu
Bioinformatics lab, Institute of Computing Technology, Chinese Academy of Sciences
Algorithm designBioinformatics (including protein structure predictionglycan identification using mass spectrometry)
Quan Yu
Quan Yu
Peng Cheng Laboratory
Wireless Communication
W
Wen Gao
Peng Cheng Laboratory, Shenzhen, China
X
Xingwu Liu
School of Mathematical Sciences, Dalian University of Technology, Dalian, China; School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada; Zhongyuan Institute of Artificial Intelligence, Zhengzhou, China
M
Ming Li
School of Mathematical Sciences, Dalian University of Technology, Dalian, China; School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada; Zhongyuan Institute of Artificial Intelligence, Zhengzhou, China