LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing long-context large language models (LLMs) relying on synthetic data augmentation often suffer from diminished reasoning faithfulness due to information distortion, missing citations, and factual inconsistencies. To address this, we propose a truth-aligned and citation-driven faithful synthesis paradigm—the first framework integrating ground-truth constraints with citation-aware prompting. Our approach systematically models three core dimensions of faithfulness: verifiability, attributability, and context anchoring. It combines truth-guided instruction synthesis, multi-stage filtering, and structured annotation techniques. We release two high-quality, open-source datasets—LongFaith-SFT and LongFaith-PO—and fine-tune LLMs using them. Empirical evaluation shows substantial improvements on LongBench and multi-hop reasoning benchmarks. Ablation studies confirm the effectiveness and scalability of our method, demonstrating that grounding synthesis in verifiable facts and explicit citations significantly enhances faithful long-context reasoning.

Technology Category

Application Category

📝 Abstract

Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.

Problem

Research questions and friction points this paper is trying to address.

Improving long-context reasoning in LLMs

Mitigating misinformation in synthetic data

Enhancing faithfulness in reasoning datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes faithful long-context reasoning datasets

Integrates ground truth and citation-based prompts

Open-sources LongFaith-SFT and LongFaith-PO datasets

🔎 Similar Papers

No similar papers found.

Authors to Follow