TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of achieving both high storage efficiency and semantic fidelity when compressing noisy text at aggressive rates. To this end, the authors propose TextEconomizer, a novel framework that uniquely integrates contextual vector selection with entropy coding within a Seq2Seq architecture. By combining a denoising Transformer with a lightweight LLaMAFormer variant, the method achieves compression ratios of 50%–80% without requiring prior knowledge of data dimensionality. Experimental results demonstrate over a 153-fold reduction in model parameters and a compression ratio of up to 5.39× (or 67× with an LSTM-based variant), while maintaining near-perfect performance on semantic reconstruction metrics such as BLEU and ROUGE, substantially outperforming existing approaches.

📝 Abstract

Lossy text compression reduces data size while preserving core meaning, making it well-suited for summarization, automated analysis, and digital archives. Despite the dominance of transformer-based models in language modeling, integrating context vectors and entropy coding into Sequence-to-Sequence (Seq2Seq) generation remains underexplored. A key challenge lies in identifying the most informative context vectors from encoder output and incorporating entropy coding to enhance storage efficiency while maintaining high-quality outputs, even under noisy text. We introduce TextEconomizer, an encoder-decoder framework paired with a transformer neural network that reduces variable-sized inputs by 50% to 80% without prior knowledge of dataset dimensions. Our model achieves competitive compression ratios via entropy coding while delivering near-perfect text quality, assessed by BLEU, ROUGE, METEOR, and semantic similarity scores. TextEconomizer operates with approximately 153x fewer parameters than comparable models, achieving a 5.39x compression ratio without sacrificing semantic quality. We also evaluate an LSTM-based autoencoder achieving a state-of-the-art 67x compression ratio with 196x fewer parameters, and LLaMAFormer, a modified transformer with 263x fewer parameters than ICAE while maintaining competitive text quality. TextEconomizer significantly surpasses existing transformer-based models in balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy compression with optimal space utilization.

Problem

Research questions and friction points this paper is trying to address.

lossy text compression

context vectors

entropy coding

semantic quality

noise robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

lossy text compression

denoising transformers

entropy coding