🤖 AI Summary
This work addresses the challenge of achieving both high storage efficiency and semantic fidelity when compressing noisy text at aggressive rates. To this end, the authors propose TextEconomizer, a novel framework that uniquely integrates contextual vector selection with entropy coding within a Seq2Seq architecture. By combining a denoising Transformer with a lightweight LLaMAFormer variant, the method achieves compression ratios of 50%–80% without requiring prior knowledge of data dimensionality. Experimental results demonstrate over a 153-fold reduction in model parameters and a compression ratio of up to 5.39× (or 67× with an LSTM-based variant), while maintaining near-perfect performance on semantic reconstruction metrics such as BLEU and ROUGE, substantially outperforming existing approaches.
📝 Abstract
Lossy text compression reduces data size while preserving core meaning, making it well-suited for summarization, automated analysis, and digital archives. Despite the dominance of transformer-based models in language modeling, integrating context vectors and entropy coding into Sequence-to-Sequence (Seq2Seq) generation remains underexplored. A key challenge lies in identifying the most informative context vectors from encoder output and incorporating entropy coding to enhance storage efficiency while maintaining high-quality outputs, even under noisy text. We introduce TextEconomizer, an encoder-decoder framework paired with a transformer neural network that reduces variable-sized inputs by 50% to 80% without prior knowledge of dataset dimensions. Our model achieves competitive compression ratios via entropy coding while delivering near-perfect text quality, assessed by BLEU, ROUGE, METEOR, and semantic similarity scores. TextEconomizer operates with approximately 153x fewer parameters than comparable models, achieving a 5.39x compression ratio without sacrificing semantic quality. We also evaluate an LSTM-based autoencoder achieving a state-of-the-art 67x compression ratio with 196x fewer parameters, and LLaMAFormer, a modified transformer with 263x fewer parameters than ICAE while maintaining competitive text quality. TextEconomizer significantly surpasses existing transformer-based models in balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy compression with optimal space utilization.