🤖 AI Summary
This work addresses the limitation of existing text anomaly detection methods, which are predominantly confined to the document level and thus unable to precisely localize anomalous segments. To advance fine-grained anomaly detection, we introduce the first token-level text anomaly detection task and construct three benchmark datasets with fine-grained annotations. We propose a unified multi-granularity detection framework that integrates deep learning and natural language processing techniques to jointly support both document-level and token-level anomaly identification. Experimental results demonstrate that our approach significantly outperforms six baseline models on the newly curated datasets. The code and datasets are publicly released to foster further research in fine-grained text anomaly detection.
📝 Abstract
Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on https://github.com/charles-cao/TokenCore.