AuditFraudBench: Benchmarking Audit Judgment in Detecting Fraudulent Misstatements

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limited capability of current large language models in detecting fraudulent misstatements in audited financial reports, particularly the absence of evaluation benchmarks focused on management narratives and disclosures. To bridge this gap, the authors construct a multitask audit fraud detection benchmark derived from real 10-K/10-Q filings, Management’s Discussion and Analysis (MD&A) disclosures, financial statements, and SEC Accounting and Auditing Enforcement Releases (AAERs). The benchmark encompasses three tasks: attribution of profit sources, identification of misleading narratives, and classification of fraud patterns. Notably, it is the first to integrate original disclosures with restated financials, leveraging regulatory enforcement actions as ground truth to evaluate models’ joint reasoning over financial data, textual narratives, and regulatory evidence. Experimental results demonstrate that leading large language models perform poorly on this comprehensive reasoning task, underscoring both the benchmark’s challenge and its necessity.

📝 Abstract

Large language models (LLMs) have shown strong performance in financial analysis and surface-level factual error detection, yet their ability to identify fraudulent financial misinformation in audited corporate reporting remains underexplored. Existing financial and audit benchmarks mainly focus on factual verification, numerical reasoning, rule compliance, or audit workflows, but rarely evaluate misleading disclosure narratives or management explanations that obscure the true drivers of reported performance. We introduce AuditFraudBench, an enforcement-grounded benchmark constructed from authentic company filings and regulatory materials, including original and restated 10-K and 10-Q filings, structured financial statements, MD&A disclosures, and SEC Accounting and Auditing Enforcement Releases (AAERs). AuditFraudBench contains three tasks: Profit Source Attribution, Misleading Narrative Detection, and Fraud Pattern Classification, which evaluate whether models can identify the true source of reported performance, detect misleading disclosure framing, and classify misconduct mechanisms into known manipulation patterns. We evaluate GPT, DeepSeek, and Qwen series LLMs on the benchmark. Results show that both proprietary and open models still struggle to jointly reason over financial figures, disclosure framing, restatement evidence, and enforcement-grounded fraud mechanisms. AuditFraudBench provides a challenging testbed for audit-relevant, evidence-grounded evaluation of LLMs in financial reporting.

Problem

Research questions and friction points this paper is trying to address.

fraud detection

audit judgment

financial misinformation

misleading disclosure

LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

AuditFraudBench

fraudulent misstatement detection

misleading narrative