🤖 AI Summary
To address perceptual quality degradation, semantic distortion, and the signal-to-noise ratio (SNR) “cliff effect” in image transmission under 6G low-bandwidth and harsh channel conditions, this paper proposes a text-guided discrete token communication paradigm. Methodologically, we leverage vision foundation models to map images into discrete visual tokens, integrate 5G NR polar codes for robust joint source-channel coding, and incorporate textual semantic priors to guide token prediction and reconstruction at ultra-low bitrates. Our key contribution is the first integration of vision-language alignment capability into the communication feedback loop—without requiring scene-specific retraining—thereby significantly mitigating performance collapse under SNR degradation. Experiments on ImageNet demonstrate that, at SNR > 0 dB, our method outperforms ADJSCC in perceptual fidelity (LPIPS reduced by 12.3%) and semantic consistency (CLIP Score increased by 8.7%), while exhibiting strong cross-dataset generalization.
📝 Abstract
With the emergence of 6G networks and proliferation of visual applications, efficient image transmission under adverse channel conditions is critical. We present a text-guided token communication system leveraging pre-trained foundation models for wireless image transmission with low bandwidth. Our approach converts images to discrete tokens, applies 5G NR polar coding, and employs text-guided token prediction for reconstruction. Evaluations on ImageNet show our method outperforms Deep Source Channel Coding with Attention Modules (ADJSCC) in perceptual quality and semantic preservation at Signal-to-Noise Ratios (SNRs) above 0 dB while mitigating the cliff effect at lower SNRs. Our system requires no scenario-specific retraining and exhibits superior cross-dataset generalization, establishing a new paradigm for efficient image transmission aligned with human perceptual priorities.