🤖 AI Summary
Scientific publishing agents often generate field-level errors in BibTeX citations due to overreliance on parametric memory. To address this, this work constructs a benchmark of 931 papers spanning four disciplines and three citation tiers, and for the first time disentangles model memory from retrieval dependence under realistic search conditions. The authors propose a two-stage citation correction framework based on co-occurrence patterns of field-level errors. By integrating large language models—including GPT-5, Claude Sonnet-4.6, and Gemini-3 Flash—with deterministic retrieval tools such as Zotero and CrossRef, the method improves field-level accuracy from 83.6% to 91.5% and full-entry correctness from 50.9% to 78.3%, with a fallback rate of only 0.8%.
📝 Abstract
Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.