š¤ AI Summary
To address the challenge of simultaneously achieving fine-grained scoring and global cross-document reasoning in document re-ranking, this paper proposes DeARāa two-stage re-ranking framework based on large language model (LLM) distillation. In the first stage, multi-objective knowledge distillationāincorporating cross-entropy loss, RankNet loss, and KL divergenceāis employed to transfer the LLaMA teacher modelās token-level relevance judgment capability to a lightweight student model. In the second stage, LoRA adapters are introduced to enable chain-of-thought reasoning at the list level, generating interpretable natural language rationales. Evaluated on standard benchmarksāincluding TREC-DL, BEIR, and NovelEvalāDeAR consistently outperforms state-of-the-art open-source methods: it achieves a 5.1-percentage-point improvement in nDCG@5 on DL20, attains 90.97 on NovelEval, and reaches 54.29% Top-1 accuracy on Natural Questionsādemonstrating substantial gains in both effectiveness and interpretability.
š Abstract
Large Language Models (LLMs) have transformed listwise document reranking by enabling global reasoning over candidate sets, yet single models often struggle to balance fine-grained relevance scoring with holistic cross-document analysis. We propose extbf{De}ep extbf{A}gent extbf{R}ank ( extbf{DeAR}), an open-source framework that decouples these tasks through a dual-stage approach, achieving superior accuracy and interpretability. In emph{Stage 1}, we distill token-level relevance signals from a frozen 13B LLaMA teacher into a compact {3, 8}B student model using a hybrid of cross-entropy, RankNet, and KL divergence losses, ensuring robust pointwise scoring. In emph{Stage 2}, we attach a second LoRA adapter and fine-tune on 20K GPT-4o-generated chain-of-thought permutations, enabling listwise reasoning with natural-language justifications. Evaluated on TREC-DL19/20, eight BEIR datasets, and NovelEval-2306, DeAR surpasses open-source baselines by +5.1 nDCG@5 on DL20 and achieves 90.97 nDCG@10 on NovelEval, outperforming GPT-4 by +3.09. Without fine-tuning on Wikipedia, DeAR also excels in open-domain QA, achieving 54.29 Top-1 accuracy on Natural Questions, surpassing baselines like MonoT5, UPR, and RankGPT. Ablations confirm that dual-loss distillation ensures stable calibration, making DeAR a highly effective and interpretable solution for modern reranking systems.footnote{Dataset and code available at https://github.com/DataScienceUIBK/DeAR-Reranking.}.