Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Traditional lossless text compression struggles to achieve both high fidelity and high compression ratios on natural language. This work proposes a semantically lossy compression framework that strategically removes portions of the input text and leverages large language models to reconstruct the original content from the retained “skeleton.” Multiple deletion strategies—including word frequency, semantic surprisal, linear programming optimization, and hybrid approaches—are designed and evaluated on the BBC News dataset. Results show that WordFreq serves as an efficient baseline, while semantic and hybrid strategies achieve superior reconstruction quality at moderate compression rates. Furthermore, a locally deployed decoder fine-tuned with QLoRA matches the performance of Gemini 2.0 Flash, and the framework demonstrates effective cross-lingual transfer between English and Chinese.

📝 Abstract

Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of the text and a large language model (LLM) reconstructs the original content from the retained skeleton. We benchmark a progression of deletion strategies, including uniform step deletion, word-length-guided deletion (WordLen), word-frequency-guided deletion (WordFreq), LP-optimized deletion (Opt), entropy-based deletion using GPT-2 surprisal, and hybrid methods that combine frequency and surprisal signals. Evaluation on the BBC News dataset across retention rates $\r_{keep} \in [0.1,0.9]$ shows three main findings. First, WordFreq is a strong low-cost baseline: despite using only a static frequency lookup, it remains competitive with much more expensive semantic methods while being far faster at the encoder. Second, semantic and hybrid methods provide their clearest gains at mild-to-moderate compression, whereas word-frequency deletion is often more robust at the lowest retention rates. Third, QLoRA fine-tuning yields a strong local decoder that is competitive with Gemini 2.0 Flash and is often strongest in decoder-only comparisons. Additional English and Chinese experiments show that the overall framework transfers across domains, while the best deletion rule remains dataset-dependent.

Problem

Research questions and friction points this paper is trying to address.

lossy text compression

semantic preservation

strategic deletion

LLM reconstruction

text summarization

Innovation

Methods, ideas, or system contributions that make the work stand out.

lossy semantic text compression

strategic deletion

LLM-based reconstruction