Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge that large language model (LLM) agents face in modeling skill synergy during retrieval—relying solely on semantic relevance between a query and individual skills is insufficient to ensure effective task completion through skill composition. To overcome this limitation, the authors propose R3, a two-stage retrieval framework comprising R3-Embedding and R3-Reranker, which explicitly models skill compatibility by leveraging LLMs’ rejection decisions as supervisory signals indicating incompatible skill combinations. Additionally, they introduce R3-Skill, the first bilingual (Chinese–English) skill retrieval benchmark grounded in realistic user requests. Experimental results demonstrate that R3 achieves strong performance with Hit@1 of 0.7714, NDCG@10 of 0.8327, and Set-Compat of 0.3525. The code, data, and models are publicly released.

📝 Abstract

LLM agents complete complex tasks by composing multiple skills, and skill retrieval is a front-end stage for agents. Skill retrieval differs fundamentally from traditional document retrieval at the supervision level: top-K joint correctness depends not only on the semantic relevance of each individual query-skill pair, but also on whether the skills retrieved together can collaborate to fulfill the task under the given query. Such "skill compatibility" cannot be derived from independent relevance alone. Yet existing LLM-based data synthesis pipelines can produce a direct supervision signal for "which skills should not be jointly retrieved under this query" -- namely the LLM's own rejection decisions -- and this signal is routinely discarded as low-quality data. To address this gap, we propose Reject-as-Resource Retriever (R3) and construct R3-Skill, a bilingual (Chinese-English) skill retrieval benchmark targeting realistic agent skill routing. R3-Skill spans four language directions, features query phrasings close to real user requests, and is verified through multi-expert cross-checking. On R3-Skill, we build a two-stage retrieval system (R3-Embedding + R3-Reranker) with skill compatibility as an explicit training signal. Gradient analysis shows that the "push-away" signal is diluted by bilateral balancing in the bi-encoder but acts as lossless graded ranking supervision in the cross-encoder -- motivating its placement at the cross-encoder stage, as confirmed by ablations on two datasets. The R3-Embedding + R3-Reranker pipeline attains Hit@1 = 0.7714, NDCG@10 = 0.8327 and Set-Compat = 0.3525 on R3-Skill. The dataset, training code and model weights are released as open source for agent skill routing.

Problem

Research questions and friction points this paper is trying to address.

skill retrieval

LLM agent

skill compatibility

query-conditional retrieval

rejection signal

Innovation

Methods, ideas, or system contributions that make the work stand out.

skill retrieval

skill compatibility

rejection-based supervision