🤖 AI Summary
This paper addresses the insufficient cross-modal contextual utilization in generative semantic communication (GenSC) by proposing Token Communications (TokCom), a token-level semantic communication framework. It introduces generative foundation models (GFMs) and multimodal large language models (MLLMs) into semantic communication for the first time, establishing a cross-modal, token-level contextual modeling paradigm. The framework incorporates a Transformer-based architecture supporting token-wise encoding/decoding, cross-modal alignment, generative semantic compression and reconstruction, and semantic-driven wireless resource adaptation. Evaluated on image GenSC tasks, TokCom achieves a 70.8% improvement in bandwidth efficiency while incurring negligible semantic and perceptual quality degradation. These results demonstrate the effectiveness and practicality of cross-modal contextual modeling in GenSC.
📝 Abstract
In this paper, we introduce token communications (TokCom), a unified framework to leverage cross-modal context information in generative semantic communications (GenSC). TokCom is a new paradigm, motivated by the recent success of generative foundation models and multimodal large language models (GFM/MLLMs), where the communication units are tokens, enabling efficient transformer-based token processing at the transmitter and receiver. In this paper, we introduce the potential opportunities and challenges of leveraging context in GenSC, explore how to integrate GFM/MLLMs-based token processing into semantic communication systems to leverage cross-modal context effectively, present the key principles for efficient TokCom at various layers in future wireless networks. We demonstrate the corresponding TokCom benefits in a GenSC setup for image, leveraging cross-modal context information, which increases the bandwidth efficiency by 70.8% with negligible loss of semantic/perceptual quality. Finally, the potential research directions are identified to facilitate adoption of TokCom in future wireless networks.