🤖 AI Summary
To address the surge of spam text on the Internet and the scarcity of labeled training data, this paper proposes GCC-Spam—a unified end-to-end framework for spam text detection. It integrates three synergistic components: (1) a Generative Adversarial Network (GAN) to synthesize high-fidelity pseudo-samples, mitigating data insufficiency; (2) a character-level similarity network that explicitly models fine-grained orthographic and phonetic confusion patterns; and (3) contrastive learning to enlarge the margin between legitimate and spam texts in the latent space, enhancing model robustness. Evaluated on real-world datasets, GCC-Spam significantly outperforms state-of-the-art methods—achieving higher accuracy with only a small number of labeled examples—and demonstrates strong resilience against adversarial attacks involving character substitution. The core contribution lies in the principled integration of GAN-based data augmentation, character-level confusion modeling, and contrastive boundary optimization, enabling effective and robust spam detection under data-scarce and adversarial conditions.
📝 Abstract
The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.