π€ AI Summary
This work addresses the challenge of subtle factual, formulaic, or numerical errors in financial and tabular question answering, which often yield plausible yet incorrect answers. The authors propose a verification mechanism grounded in an atomic claim market: complex questions are decomposed into typed atomic claims, which specialized trading agents buy and sell via a market mechanism to express confidence. These agentsβ aggregated, confidence-weighted accept/reject decisions are then synthesized into executable Python programs, subsequently refined by a code-aware verifier. By replacing conventional free-form debate with a structured market-based approach, the method substantially enhances robustness in high-stakes numerical reasoning. It achieves state-of-the-art performance across ten benchmarks, including FinQA (78.3%), FinanceMath (76.0%), MultiHiertt (71.2%), ESGenius (86.9%), and FinChart-Bench (85.6% average).
π Abstract
Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.