Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
Zero-shot network content classification often suffers from semantic overlap and systematic misclassification due to ambiguous category definitions. This work identifies definition quality as a critical yet overlooked factor in zero-shot embedding systems and introduces the first training-free, iterative framework for refining category definitions. Leveraging large language models (LLMs) as feedback-driven optimizers, the approach dynamically refines semantic category prototypes—rather than model parameters—using structured signals derived from misclassified samples. We propose three LLM-guided refinement strategies: example-guided, confusion-aware, and history-aware refinement, and introduce B2MWT-10C, a new annotated benchmark comprising ten categories. Evaluated across 13 state-of-the-art embedding models, our method consistently improves classification performance, demonstrating that optimizing category definitions yields significant gains in zero-shot settings. The dataset and code are publicly released.
📝 Abstract
Web filtering systems rely on accurate web content classification to block cyber threats, prevent data exfiltration, and ensure compliance. However, classification is increasingly difficult due to the dynamic and rapidly evolving nature of the modern web. Embedding-based zero-shot approaches map content and category descriptions into a shared semantic space, enabling label assignment without labeled training data, but remain highly sensitive to definition quality. Poorly specified or ambiguous definitions create semantic overlap in the embedding space, leading to systematic misclassification. In this paper, we propose a training-free, adaptive iterative definition refinement framework that improves zero-shot web content classification by progressively optimizing category definitions rather than updating model parameters. Using LLMs as feedback-driven definition optimizers, we investigate three refinement strategies namely example-guided, confusion-aware, and history-aware, each refining class descriptions using structured signals from misclassified instances. Furthermore, we introduce a human-labeled benchmark of 10 URL categories with 1,000 samples per class and evaluate across 13 state-of-the-art embedding foundation models. Results demonstrate that iterative definition refinement consistently improves classification performance across diverse architectures, establishing definition quality as a critical and underexplored factor in embedding-based systems. The dataset is available at https://github.com/naeemrehmat/B2MWT-10C.
Problem

Research questions and friction points this paper is trying to address.

zero-shot classification
definition quality
semantic overlap
web content classification
embedding-based systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot classification
definition refinement
semantic prototype optimization
LLM-based feedback
embedding-based classification