Synthetic Data-Driven Prompt Tuning for Financial QA over Tables and Documents

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Financial documents—such as financial statements and balance sheets—contain lengthy tables and multi-page textual content; large language models (LLMs) exhibit strong dependence on high-quality prompts for numerical reasoning, yet existing prompting methods suffer from heavy reliance on manually annotated data, poor generalization, and difficulty adapting to novel document structures. Method: We propose a synthetic-data-driven self-optimizing prompting framework featuring a closed-loop self-improvement mechanism: a synthetic data generator actively identifies prompt deficiencies, while multi-level validation ensures synthetic data quality—enabling iterative prompt refinement without external annotations. The method dynamically integrates synthetically generated financial tables and textual fragments to enhance prompt robustness and accuracy. Contribution/Results: Evaluated on the DocMath-Eval benchmark, our approach significantly outperforms conventional prompting methods, achieving higher accuracy and superior noise resilience. Results demonstrate the effectiveness and scalability of synthetic data for prompt learning in financial document understanding.

Technology Category

Application Category

📝 Abstract
Financial documents like earning reports or balance sheets often involve long tables and multi-page reports. Large language models have become a new tool to help numerical reasoning and understanding these documents. However, prompt quality can have a major effect on how well LLMs perform these financial reasoning tasks. Most current methods tune prompts on fixed datasets of financial text or tabular data, which limits their ability to adapt to new question types or document structures, or they involve costly and manually labeled/curated dataset to help build the prompts. We introduce a self-improving prompt framework driven by data-augmented optimization. In this closed-loop process, we generate synthetic financial tables and document excerpts, verify their correctness and robustness, and then update the prompt based on the results. Specifically, our framework combines a synthetic data generator with verifiers and a prompt optimizer, where the generator produces new examples that exposes weaknesses in the current prompt, the verifiers check the validity and robustness of the produced examples, and the optimizer incrementally refines the prompt in response. By iterating these steps in a feedback cycle, our method steadily improves prompt accuracy on financial reasoning tasks without needing external labels. Evaluation on DocMath-Eval benchmark demonstrates that our system achieves higher performance in both accuracy and robustness than standard prompt methods, underscoring the value of incorporating synthetic data generation into prompt learning for financial applications.
Problem

Research questions and friction points this paper is trying to address.

Improving financial reasoning accuracy by optimizing prompts through synthetic data generation
Overcoming limitations of fixed datasets in adapting to new financial question types
Enhancing prompt robustness without requiring costly manual data labeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-improving prompt framework with data-augmented optimization
Generates synthetic financial tables and document excerpts
Combines synthetic data generator with verifiers and optimizer
🔎 Similar Papers
No similar papers found.
Y
Yaoning Yu
University of Illinois Urbana-Champaign, Champaign, IL, United States
K
Kai-Min Chang
U.S. Bank, United States
Y
Ye Yu
University of Illinois Urbana-Champaign, Champaign, IL, United States
Kai Wei
Kai Wei
Amazon
Computational social scienceNLPSLU
H
Haojing Luo
Starc.institute, United States
Haohan Wang
Haohan Wang
School of Information Sciences, University of Illinois Urbana-Champaign
Computational BiologyAgentic AIAI4ScienceAI security