π€ AI Summary
To address high semantic redundancy and the trade-off between reconstruction fidelity and bandwidth efficiency in visual semantic communication, this paper proposes a text-visual joint semantic coding and adaptive retransmission framework. Methodologically: (1) it jointly encodes image key latent variables and textual semantic descriptions; (2) employs a diffusion model for conditional image reconstruction; and (3) introduces a semantic-consistency-driven adaptive retransmission mechanism that dynamically identifies and incrementally retransmits semantically mismatched feature blocks. The key innovation lies in embedding semantic alignment evaluation into retransmission decisions and leveraging textual supervision to enhance both semantic accuracy and visual fidelity of reconstructed images. Experiments demonstrate that the method reduces transmitted data volume by 42.6% on average while improving semantic accuracy by 18.3% and maintaining PSNR β₯ 28.5 dBβachieving efficient and robust semantic-level visual communication.
π Abstract
Semantic communication focuses on conveying the intrinsic meaning of data rather than its raw symbolic representation. For visual content, this paradigm shifts from traditional pixel-level transmission toward leveraging the semantic structure of images to communicate visual meaning. Existing approaches generally follow one of two paths: transmitting only text descriptions, which often fail to capture precise spatial layouts and fine-grained appearance details; or transmitting text alongside dense latent visual features, which tends to introduce substantial semantic redundancy. A key challenge, therefore, is to reduce semantic redundancy while preserving semantic understanding and visual fidelity, thereby improving overall transmission efficiency. This paper introduces a diffusion-based semantic communication framework with adaptive retransmission. The system transmits concise text descriptions together with a limited set of key latent visual features, and employs a diffusion-based inpainting model to reconstruct the image. A receiver-side semantic consistency mechanism is designed to evaluate the alignment between the reconstructed image and the original text description. When a semantic discrepancy is detected, the receiver triggers a retransmission to request a small set of additional latent blocks and refine the image reconstruction. This approach significantly reduces bandwidth usage while preserving high semantic accuracy, achieving an efficient balance between reconstruction quality and transmission overhead.