π€ AI Summary
This study addresses the challenge of fact-checking misleading claims constructed through multi-turn interactions in conversational audioβa scenario poorly handled by existing methods. The work presents the first systematic investigation of claim verification in this setting, introducing a calibrated multimodal verification approach that integrates a context-aware audio encoder with a dialogue-aware textual model. To support research in this domain, the authors construct MAD2, a new benchmark comprising 1,000 dialogues and 3,368 verifiable claims. Experimental results demonstrate that dialogue structure exerts a stronger influence on verification performance than the deceptive phrasing of claims themselves. Notably, leveraging only preceding contextual information enables near-offline accuracy in real-time verification, underscoring the critical role of dialogue context in enhancing multimodal fact-checking effectiveness.
π Abstract
Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.