🤖 AI Summary
Hallucinations—factual inaccuracies or irrelevant content—in large language model (LLM) outputs severely undermine their reliability and real-world deployment. To address this, we propose an end-to-end hallucination detection and rewriting framework tailored for production environments. Our method comprises two key components: (1) a multi-granularity detection module integrating named entity recognition (NER), natural language inference (NLI), and span-based detection (SBD), enhanced by a decision tree for fine-grained classification; and (2) a lightweight rewriting mechanism that balances accuracy, latency, and computational cost. Evaluated offline and validated on live production traffic, the system significantly improves response fidelity while meeting stringent operational requirements: end-to-end latency < 200 ms, availability > 99.9%, and high throughput. Our core contribution is the first hallucination mitigation architecture that jointly achieves high detection accuracy, low latency, and engineering deployability in production-grade LLM services.
📝 Abstract
Hallucination, a phenomenon where large language models (LLMs) produce output that is factually incorrect or unrelated to the input, is a major challenge for LLM applications that require accuracy and dependability. In this paper, we introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within LLMs. Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD), and an intricate decision tree-based process to reliably detect a wide range of hallucinations in LLM responses. Furthermore, we have crafted a rewriting mechanism that maintains an optimal mix of precision, response time, and cost-effectiveness. We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics, which are crucial for real-world deployment of these technologies. Our extensive evaluation, utilizing offline data and live production traffic, confirms the efficacy of our proposed framework and service.