Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
Traditional lossless text compression struggles to achieve both high fidelity and high compression ratios on natural language. This work proposes a semantically lossy compression framework that strategically removes portions of the input text and leverages large language models to reconstruct the original content from the retained “skeleton.” Multiple deletion strategies—including word frequency, semantic surprisal, linear programming optimization, and hybrid approaches—are designed and evaluated on the BBC News dataset. Results show that WordFreq serves as an efficient baseline, while semantic and hybrid strategies achieve superior reconstruction quality at moderate compression rates. Furthermore, a locally deployed decoder fine-tuned with QLoRA matches the performance of Gemini 2.0 Flash, and the framework demonstrates effective cross-lingual transfer between English and Chinese.
📝 Abstract
Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of the text and a large language model (LLM) reconstructs the original content from the retained skeleton. We benchmark a progression of deletion strategies, including uniform step deletion, word-length-guided deletion (WordLen), word-frequency-guided deletion (WordFreq), LP-optimized deletion (Opt), entropy-based deletion using GPT-2 surprisal, and hybrid methods that combine frequency and surprisal signals. Evaluation on the BBC News dataset across retention rates $\r_{keep} \in [0.1,0.9]$ shows three main findings. First, WordFreq is a strong low-cost baseline: despite using only a static frequency lookup, it remains competitive with much more expensive semantic methods while being far faster at the encoder. Second, semantic and hybrid methods provide their clearest gains at mild-to-moderate compression, whereas word-frequency deletion is often more robust at the lowest retention rates. Third, QLoRA fine-tuning yields a strong local decoder that is competitive with Gemini 2.0 Flash and is often strongest in decoder-only comparisons. Additional English and Chinese experiments show that the overall framework transfers across domains, while the best deletion rule remains dataset-dependent.
Problem

Research questions and friction points this paper is trying to address.

lossy text compression
semantic preservation
strategic deletion
LLM reconstruction
text summarization
Innovation

Methods, ideas, or system contributions that make the work stand out.

lossy semantic text compression
strategic deletion
LLM-based reconstruction
word-frequency-guided deletion
QLoRA fine-tuning