🤖 AI Summary
This work addresses automated assessment of scientific impact by proposing the first causal pre-trained language model framework tailored for long-horizon citation rate prediction. Methodologically, it builds upon a causal Transformer architecture augmented with a lightweight linear regression head, incorporates gradient saliency analysis, and employs strict temporal hold-out validation; it further introduces a simple yet robust fine-tuning paradigm for efficient temporal impact modeling. The key contribution is the first adaptation of causal language modeling to monthly average citation rate regression, achieving a Spearman correlation of ρ = 0.826 on a dataset of over 900,000 biomedical papers—surpassing the prior state-of-the-art by 27 percentage points. Scaling law analysis confirms consistent performance gains with increasing model size and data volume.
📝 Abstract
Predicting the future citation rates of academic papers is an important step toward the automation of research evaluation and the acceleration of scientific progress. We present $ extbf{ForeCite}$, a simple but powerful framework to append pre-trained causal language models with a linear head for average monthly citation rate prediction. Adapting transformers for regression tasks, ForeCite achieves a test correlation of $
ho = 0.826$ on a curated dataset of 900K+ biomedical papers published between 2000 and 2024, a 27-point improvement over the previous state-of-the-art. Comprehensive scaling-law analysis reveals consistent gains across model sizes and data volumes, while temporal holdout experiments confirm practical robustness. Gradient-based saliency heatmaps suggest a potentially undue reliance on titles and abstract texts. These results establish a new state-of-the-art in forecasting the long-term influence of academic research and lay the groundwork for the automated, high-fidelity evaluation of scientific contributions.