EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of detecting evasive responses in earnings call transcripts, a task hindered by the absence of large-scale benchmark datasets and critical to financial transparency. The authors introduce EvasionBench, a novel benchmark comprising 30,000 training samples and 1,000 expert-annotated test instances, along with a multi-model consensus annotation framework. This framework leverages disagreements among state-of-the-art large language models (LLMs) to identify high-value borderline cases, which are then adjudicated by an LLM-as-Judge to assign final labels. By transforming inter-model disagreement into an implicit regularization signal, the approach transcends conventional single-model distillation paradigms. A specialized 4-billion-parameter model, Eva-4B, trained on this dataset achieves 81.3% accuracy on the test set—outperforming baseline methods by 25 percentage points—while matching the performance of leading LLMs at substantially lower inference cost.

Technology Category

Application Category

📝 Abstract
We present EvasionBench, a comprehensive benchmark for detecting evasive responses in corporate earnings call question-and-answer sessions. Drawing from 22.7 million Q&A pairs extracted from S&P Capital IQ transcripts, we construct a rigorously filtered dataset and introduce a three-level evasion taxonomy: direct, intermediate, and fully evasive. Our annotation pipeline employs a Multi-Model Consensus (MMC) framework, combining dual frontier LLM annotation with a three-judge majority voting mechanism for ambiguous cases, achieving a Cohen's Kappa of 0.835 on human inter-annotator agreement. We release: (1) a balanced 84K training set, (2) a 1K gold-standard evaluation set with expert human labels, and (3) [Eva-4B], a 4-billion parameter classifier fine-tuned from Qwen3-4B that achieves 84.9% Macro-F1, outperforming Claude 4.5, GPT-5.2, and Gemini 3 Flash. Our ablation studies demonstrate the effectiveness of multi-model consensus labeling over single-model annotation. EvasionBench fills a critical gap in financial NLP by providing the first large-scale benchmark specifically targeting managerial communication evasion.
Problem

Research questions and friction points this paper is trying to address.

evasive answers
financial Q&A
earnings calls
financial transparency
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-model consensus
LLM-as-judge
disagreement mining
evasive answer detection
implicit regularization
🔎 Similar Papers
No similar papers found.