TAD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the insufficient evaluation of embedding-based text anomaly detection (e.g., spam, misinformation, profanity identification) by introducing TAD-Bench—the first comprehensive benchmark for this task. It systematically decouples and jointly evaluates the synergy between diverse text embeddings (BERT, RoBERTa, Sentence-BERT, and LLM-derived embeddings) and classical/deep anomaly detection algorithms (Isolation Forest, OC-SVM, DeepSVDD, GOAD). TAD-Bench covers cross-domain, multi-granularity real-world datasets and provides a unified framework to assess embedding–algorithm coupling effects, revealing principled alignment patterns between embedding characteristics and task granularity. Experiments demonstrate that higher-quality embeddings do not necessarily yield better detection performance; optimal embedding–algorithm combinations improve F1 scores by up to 12.7% on fine-grained tasks. The project releases an open-source, reproducible framework and benchmark suite, establishing a standardized evaluation foundation for text anomaly detection.

Technology Category

Application Category

📝 Abstract
Text anomaly detection is crucial for identifying spam, misinformation, and offensive language in natural language processing tasks. Despite the growing adoption of embedding-based methods, their effectiveness and generalizability across diverse application scenarios remain under-explored. To address this, we present TAD-Bench, a comprehensive benchmark designed to systematically evaluate embedding-based approaches for text anomaly detection. TAD-Bench integrates multiple datasets spanning different domains, combining state-of-the-art embeddings from large language models with a variety of anomaly detection algorithms. Through extensive experiments, we analyze the interplay between embeddings and detection methods, uncovering their strengths, weaknesses, and applicability to different tasks. These findings offer new perspectives on building more robust, efficient, and generalizable anomaly detection systems for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Text Anomaly Detection
Embedding Methods
Effectiveness Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

TAD-Bench
Anomaly Detection
Embedding Methods
🔎 Similar Papers
No similar papers found.
Y
Yang Cao
School of Computing and Information Technology, Great Bay University, China; Great Bay Institute for Advanced Study, Great Bay University, China
S
Sikun Yang
School of Computing and Information Technology, Great Bay University, China; Great Bay Institute for Advanced Study, Great Bay University, China
C
Chen Li
Graduate School of Informatics, Nagoya University, Japan
Haolong Xiang
Haolong Xiang
Macquarie University
Big DataData Mining
L
Lianyong Qi
College of Computer Science and Technology, China University of Petroleum (East China), China
B
Bo Liu
College of Cyberspace Security, Zhengzhou University, China
R
Rongsheng Li
School of Computer, Harbin Engineering University, China
M
Ming Liu
School of IT, Deakin University, Australia