Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

Qiang Ji; Ruizhi Qiao; Wei Wen; Xing Sun; Zifei Wang

arxiv: 2606.03565 · v2 · pith:V6YRODC2new · submitted 2026-06-02 · 💻 cs.IR

Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

Zifei Wang , Wei Wen , Qiang Ji , Ruizhi Qiao , Xing Sun This is my paper

Pith reviewed 2026-06-28 08:11 UTC · model grok-4.3

classification 💻 cs.IR

keywords skill retrievalLLM agentsrejection decisionstwo-stage retrievalskill compatibilitybenchmark

0 comments

The pith

LLM rejection decisions supply direct supervision for skill compatibility in agent retrieval, improving joint correctness over relevance-only methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims skill retrieval for LLM agents differs from document retrieval because top-K success depends on whether retrieved skills can collaborate to complete the query, not merely on individual relevance scores. Existing synthesis pipelines discard the LLM's own rejection decisions, yet these decisions encode exactly the missing compatibility signal. The authors release the R3-Skill benchmark and train a two-stage system in which an embedding model handles initial relevance while a cross-encoder reranker is trained explicitly on the rejection-derived push-away signal. This yields Hit@1 of 0.7714 and Set-Compat of 0.3525, with ablations confirming the cross-encoder placement is required for the signal to remain effective.

Core claim

The R3-Embedding plus R3-Reranker pipeline treats LLM rejection decisions as an explicit training signal for query-conditional skill compatibility; gradient analysis shows the push-away signal is diluted under bilateral balancing in the bi-encoder but supplies lossless graded supervision in the cross-encoder, which is why compatibility training is placed only at the reranker stage.

What carries the argument

Two-stage R3 retriever (embedding model followed by cross-encoder reranker) trained with rejection decisions as the compatibility supervision signal.

If this is right

The bi-encoder stage can remain a standard relevance model while only the cross-encoder needs the extra compatibility objective.
Ablations on two datasets confirm that moving the rejection signal to the bi-encoder degrades performance.
The R3-Skill benchmark supplies query phrasings and multi-expert verification that match realistic agent routing scenarios across four language directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Collecting rejections at deployment time could create a continual training loop that refines the retriever without new human labels.
The same rejection-as-resource pattern may apply to other composition tasks such as tool selection or multi-document summarization where joint correctness matters.
Because the signal is query-conditional, the method could extend to dynamic skill sets that change over time without retraining the entire embedding space.

Load-bearing premise

The LLM's rejection decisions under a given query reliably indicate which skills are incompatible rather than reflecting unrelated model shortcomings.

What would settle it

A retriever trained on the same data but without any rejection-derived compatibility signal matching or exceeding the reported Hit@1 and Set-Compat scores on R3-Skill would falsify the value of the signal.

Figures

Figures reproduced from arXiv: 2606.03565 by Qiang Ji, Ruizhi Qiao, Wei Wen, Xing Sun, Zifei Wang.

**Figure 1.** Figure 1: Qwen3 tokenizer length distribution over the 10,246 skills in R3-Skill, computed on the concatenation of name + description + body. Mean = 2,073 tokens, p95 = 5,526, p99 = 9,983, max = 31,571; loading the entire library at once requires roughly 21.2M input tokens—and that is for skill text alone, before counting system prompts, user queries and LLM outputs, already far beyond the single-call context window… view at source ↗

**Figure 2.** Figure 2: Illustration of skill compatibility. Whether the same skill pair should be jointly retrieved can flip across queries—the core distinction between skill retrieval and document retrieval. means that even though every member is independently a plausible candidate, the set conflicts under query q—e.g. overlapping functionality, inconsistent style, or incompatible ecosystems—and should not be jointly retrieved;… view at source ↗

**Figure 3.** Figure 3: overviews the construction pipeline, from raw collection to the final splits [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Bilateral balancing on the shared e(s ′ ): a SKIP push under query q is balanced by a positive pull under another query q˜; their equilibrium fixes the geometric position of e(s ′ ). B.2 DERIVATION OF THEOREM 1 The InfoNCE loss on the candidate pool C is LNCE = − log ps+ . Among the logits {zc}c∈C, only zs ′ depends on e(s ′ ), with ∂zs ′/∂e(s ′ ) = e(q)/τ . Differentiating yields ∇e(s ′) L (q) NCE = − ∂ l… view at source ↗

**Figure 5.** Figure 5: Cluster quality vs. K (K ∈ {10, 20, 30, 40}). The three indicators are inertia (within-cluster sum of squared distances, lower is better), intra-cluster sim (within-cluster topical similarity, higher is better), and inter-centroid sim (similarity between adjacent centroids, lower is better); for joint comparison, all three curves are normalized to [0,1] and re-oriented so that “higher is better”. K = 40 is… view at source ↗

**Figure 6.** Figure 6: Hierarchical taxonomy view of the R3-Skill skill pool under K = 40 clustering. The inner ring shows the 8 super-domains (A–H), obtained by topic-merging the 40 sub-clusters; the outer ring shows the 40 sub-clusters, with sub-cluster names and arc lengths proportional to skill counts. Inner and outer rings sharing the same color family belong to the same super-domain. The figure exposes the hierarchical str… view at source ↗

**Figure 7.** Figure 7: LLM annotation flow: skill set → DECISION → WRITE branch generates queries in 6 styles × 4 directions / SKIP branch is archived → enters taxonomy classification and multi-expert verification [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Stacked bar chart of the 8 reject reason classes across the 4 language directions. en2en is dominated by domain_mismatch; in en2zh, forced_chaining jumps to 56.0%—reject morphology differs structurally between Chinese and English [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

LLM agents complete complex tasks by composing multiple skills, and skill retrieval is a front-end stage for agents. Skill retrieval differs fundamentally from traditional document retrieval at the supervision level: top-K joint correctness depends not only on the semantic relevance of each individual query-skill pair, but also on whether the skills retrieved together can collaborate to fulfill the task under the given query. Such "skill compatibility" cannot be derived from independent relevance alone. Yet existing LLM-based data synthesis pipelines can produce a direct supervision signal for "which skills should not be jointly retrieved under this query" -- namely the LLM's own rejection decisions -- and this signal is routinely discarded as low-quality data. To address this gap, we propose Reject-as-Resource Retriever (R3) and construct R3-Skill, a bilingual (Chinese-English) skill retrieval benchmark targeting realistic agent skill routing. R3-Skill spans four language directions, features query phrasings close to real user requests, and is verified through multi-expert cross-checking. On R3-Skill, we build a two-stage retrieval system (R3-Embedding + R3-Reranker) with skill compatibility as an explicit training signal. Gradient analysis shows that the "push-away" signal is diluted by bilateral balancing in the bi-encoder but acts as lossless graded ranking supervision in the cross-encoder -- motivating its placement at the cross-encoder stage, as confirmed by ablations on two datasets. The R3-Embedding + R3-Reranker pipeline attains Hit@1 = 0.7714, NDCG@10 = 0.8327 and Set-Compat = 0.3525 on R3-Skill. The dataset, training code and model weights are released as open source for agent skill routing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns LLM rejection decisions into explicit supervision for skill compatibility and releases a new benchmark plus code to support it.

read the letter

This paper's main contribution is a benchmark called R3-Skill for query-conditional skill retrieval and a two-stage retriever that treats the LLM's rejection decisions as a resource rather than noise.

They construct the benchmark with realistic queries in Chinese and English, have it cross-checked by experts, and release the dataset, code, and weights. The R3-Embedding plus R3-Reranker uses the rejection signal at the cross-encoder stage after gradient analysis shows the bi-encoder loses some of the push-away effect. Ablations on two datasets isolate the gain from this signal, leading to Hit@1 of 0.7714 and other metrics on their test set.

What stands out is the focus on joint compatibility instead of just individual relevance, which is a real difference from standard IR. The open sourcing makes it easy to build on. The analysis and ablations give some confidence that the approach is not just fitting noise.

A potential soft spot is whether the rejection signal remains useful when the downstream agent LLM differs from the one used to create the training data. The paper tests on two datasets but the benchmark is new, so broader validation would help. The Set-Compat metric at 0.35 indicates that full joint correctness is still challenging.

This is aimed at people building LLM agents that need to route to multiple skills. Readers working on retrieval for agents would find the benchmark and the two-stage design useful.

I would recommend sending it for peer review. The new benchmark and the released materials make it worth a closer look even if some assumptions need more testing.

Referee Report

2 major / 2 minor

Summary. The paper claims that skill retrieval for LLM agents requires explicit modeling of query-conditional skill compatibility (beyond per-skill relevance), which can be supervised via LLM rejection decisions that are normally discarded. It introduces the bilingual R3-Skill benchmark (query phrasings close to real requests, multi-expert cross-checked) and the R3 two-stage retriever (R3-Embedding bi-encoder followed by R3-Reranker cross-encoder) trained with compatibility as an explicit signal. Gradient analysis shows the push-away gradient is diluted under bilateral balancing in the bi-encoder but remains lossless graded supervision for the cross-encoder; ablations on two datasets confirm the placement. The pipeline reports Hit@1 = 0.7714, NDCG@10 = 0.8327 and Set-Compat = 0.3525 on R3-Skill, with dataset, code and weights released.

Significance. If the results and ablations hold, the work is significant for LLM agent systems: it formalizes a distinction between document and skill retrieval, repurposes a routinely discarded supervision signal, supplies a verified bilingual benchmark, and demonstrates a practical two-stage architecture with mechanistic justification. The open release of data, training code and weights directly supports reproducibility and follow-on work in agent skill routing.

major comments (2)

[Benchmark construction] Benchmark construction section: the claim of multi-expert cross-checking is central to treating R3-Skill as reliable ground truth, yet no inter-annotator agreement statistics, annotation protocol, or disagreement resolution procedure are provided; without these the reported metrics rest on an unquantified verification step.
[Gradient analysis / Ablations] Gradient analysis and ablations: the statement that the rejection signal supplies 'non-redundant' compatibility supervision (distinct from independent relevance) is load-bearing for the two-stage design; the manuscript must show an explicit control (e.g., training the reranker on relevance-only labels) to confirm the observed Hit@1 / NDCG@10 gains are not explainable by standard ranking objectives alone.

minor comments (2)

[Results] Results tables: report the full set of baselines (including relevance-only bi-encoder and cross-encoder variants) so that the incremental contribution of the rejection signal can be directly compared.
[Evaluation metrics] Metric definitions: provide the exact formula or reference for Set-Compat; if it is a custom set-level compatibility measure it should be defined in the main text rather than assumed known.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the claim of multi-expert cross-checking is central to treating R3-Skill as reliable ground truth, yet no inter-annotator agreement statistics, annotation protocol, or disagreement resolution procedure are provided; without these the reported metrics rest on an unquantified verification step.

Authors: We agree that quantitative validation of the annotation process is necessary to support the benchmark's reliability. In the revised manuscript we will add a subsection detailing the annotation protocol, the number of experts, inter-annotator agreement statistics (including percentage agreement and Cohen's kappa), and the disagreement resolution procedure. revision: yes
Referee: [Gradient analysis / Ablations] Gradient analysis and ablations: the statement that the rejection signal supplies 'non-redundant' compatibility supervision (distinct from independent relevance) is load-bearing for the two-stage design; the manuscript must show an explicit control (e.g., training the reranker on relevance-only labels) to confirm the observed Hit@1 / NDCG@10 gains are not explainable by standard ranking objectives alone.

Authors: While the existing ablations on two datasets already demonstrate the benefit of placing the compatibility signal at the cross-encoder, we acknowledge that an explicit relevance-only control would strengthen the claim of non-redundancy. We will add this control experiment (training the reranker without the rejection signal) and report the resulting Hit@1 and NDCG@10 metrics in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation relies on an externally generated supervision signal (LLM rejection decisions) that the paper explicitly argues cannot be derived from independent relevance scores, a new benchmark constructed and verified via multi-expert cross-checking, gradient analysis demonstrating differential behavior of the signal in bi- versus cross-encoders, and ablations isolating its contribution on two datasets. None of these elements reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the performance numbers are presented as empirical outcomes of the proposed pipeline rather than tautological restatements of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information on free parameters, axioms, or invented entities is available from the abstract alone.

pith-pipeline@v0.9.1-grok · 5872 in / 1251 out tokens · 38130 ms · 2026-06-28T08:11:49.458946+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references

[1]

҂ିԛགྷ skill֥nameĠ҂ିԛགྷૼཁՖႇ໓ nameಪ෪ཿ čPDF / API / SQL / GitHub ᆃোĎᄍྸb
[2]

҂ေᅶӒ description / bodyb
[3]

۱skill টϺ෰ປӮ

сྶିಞ AI؎"۱skill টϺ෰ປӮ"b 5.൤b Skill: {skill_rendered} en2zh — English skill → Chinese query. ۱AI ᇹ൭Ϻફቓ൙b ٺ**ႇ໓ skill**ି৯čname / description / bodyऌ෱ ۱**໙ี**đѩ౏ **҂ି** ԛགྷ skillႇ໓ nameᇏ໓ᆰၲđ ္҂ିᅶӒ description / bodyჰओb ቎࿖Ď ğ{style_desc} ཿቔ႗ჿඏğ
[4]

Ⴈ**ᇏ໓**؇50~300ࡆן1~2۬ڄ Ќӻഈ૫ପ่b
[5]

ປಆႨᇏ໓đ**҂ေԛགྷႇ໓ֆՍ**ಪ෪ཿೂ PDF / API / SQL / GitHub ᆃোᄍྸĎ b
[6]

҂ିԛགྷ skillႇ໓ nameି/ಀ૭ඍb
[7]

҂ေᅶӒ description / bodyओሰb
[8]

You play a real user asking an AI assistant for help

сྶିಞ AI؎"۱skill"b 6.൤b Skill: {skill_rendered} en2en — English skill → English query. You play a real user asking an AI assistant for help. You can only see the skill below (name / description / body excerpt). The question you write **must not** mention the skill's name (or close synonyms), and **must not** copy sentences from description / body verbatim....
[9]

You may layer in 1~2 secondary styles to keep it natural, but stay anchored to the main style above

Write in English, 50~300 words. You may layer in 1~2 secondary styles to keep it natural, but stay anchored to the main style above
[10]

Do not mention the skill's name or any near-synonym translation of it
[11]

Do not copy whole sentences from description / body; rephrase
[13]

Skill: {skill_rendered} zh2en — Chinese skill → English query

Output only the question itself; no prefixes, tags, or explanations. Skill: {skill_rendered} zh2en — Chinese skill → English query. You play a real user asking an AI assistant for help in English. The skill below is in **Chinese** (name / description / body excerpt). Write an **English user question** based on it; you **must not** mention the skill's name...
[14]

You may layer 1~2 secondary styles for naturalness, but the main style above must show

Write in **English** only, 50~300 words. You may layer 1~2 secondary styles for naturalness, but the main style above must show
[15]

No Chinese characters in the output
[16]

Do not surface the skill's name or any direct translation of it; describe by function/scenario instead
[17]

Do not copy sentences from description / body
[18]

I should use this skill

The AI must be able to read your question and infer "I should use this skill"
[19]

۱X"ᆃᇕ႗ೖĠ -ᆃહ໙b Ď ğ -৘ğᄝ <DECISION>WRITE</DECISION>໙ีĠ -৘ğᄝ <DECISION>SKIP</DECISION>ඪૼჰၹđ҂ေᄜཿ໙ีb ֒ࣇWRITE ൈĎ ğ -ေ౰ğ{style_desc} -؇50~300ࡆן1~2Ġ - сྶ**๝ൈ**ླေᆃ {k}۱skillൔ

Output only the question itself; no prefixes, tags, or explanations. Skill: {skill_rendered} 17 D.3 Multi-skill prompts ( k = 2 / k = 3, two-stage CoT WRITE/SKIP) zh2zh — Chinese skill set → Chinese query. b ୆ {k}۱skillି৯čname / description / bodyൻԛğ ᆃ{k}۱skill ** ๝ൈ**ᄝ **߅/ླ౰** ѓሙğ -ປӮ҂ਔ / ૼཁҗಌĠ - ҂఼ྛ௓՘ğ҂ି൞"۱X"ᆃᇕ႗ೖĠ -ᆃહ໙b Ď ğ -৘ğᄝ <DECISION>WRITE</DECISI...

[1] [1]

҂ିԛགྷ skill֥nameĠ҂ିԛགྷૼཁՖႇ໓ nameಪ෪ཿ čPDF / API / SQL / GitHub ᆃোĎᄍྸb

[2] [2]

҂ေᅶӒ description / bodyb

[3] [3]

۱skill টϺ෰ປӮ

сྶିಞ AI؎"۱skill টϺ෰ປӮ"b 5.൤b Skill: {skill_rendered} en2zh — English skill → Chinese query. ۱AI ᇹ൭Ϻફቓ൙b ٺ**ႇ໓ skill**ି৯čname / description / bodyऌ෱ ۱**໙ี**đѩ౏ **҂ି** ԛགྷ skillႇ໓ nameᇏ໓ᆰၲđ ္҂ିᅶӒ description / bodyჰओb ቎࿖Ď ğ{style_desc} ཿቔ႗ჿඏğ

[4] [4]

Ⴈ**ᇏ໓**؇50~300ࡆן1~2۬ڄ Ќӻഈ૫ପ่b

[5] [5]

ປಆႨᇏ໓đ**҂ေԛགྷႇ໓ֆՍ**ಪ෪ཿೂ PDF / API / SQL / GitHub ᆃোᄍྸĎ b

[6] [6]

҂ିԛགྷ skillႇ໓ nameି/ಀ૭ඍb

[7] [7]

҂ေᅶӒ description / bodyओሰb

[8] [8]

You play a real user asking an AI assistant for help

сྶିಞ AI؎"۱skill"b 6.൤b Skill: {skill_rendered} en2en — English skill → English query. You play a real user asking an AI assistant for help. You can only see the skill below (name / description / body excerpt). The question you write **must not** mention the skill's name (or close synonyms), and **must not** copy sentences from description / body verbatim....

[9] [9]

You may layer in 1~2 secondary styles to keep it natural, but stay anchored to the main style above

Write in English, 50~300 words. You may layer in 1~2 secondary styles to keep it natural, but stay anchored to the main style above

[10] [10]

Do not mention the skill's name or any near-synonym translation of it

[11] [11]

Do not copy whole sentences from description / body; rephrase

[12] [13]

Skill: {skill_rendered} zh2en — Chinese skill → English query

Output only the question itself; no prefixes, tags, or explanations. Skill: {skill_rendered} zh2en — Chinese skill → English query. You play a real user asking an AI assistant for help in English. The skill below is in **Chinese** (name / description / body excerpt). Write an **English user question** based on it; you **must not** mention the skill's name...

[13] [14]

You may layer 1~2 secondary styles for naturalness, but the main style above must show

Write in **English** only, 50~300 words. You may layer 1~2 secondary styles for naturalness, but the main style above must show

[14] [15]

No Chinese characters in the output

[15] [16]

Do not surface the skill's name or any direct translation of it; describe by function/scenario instead

[16] [17]

Do not copy sentences from description / body

[17] [18]

I should use this skill

The AI must be able to read your question and infer "I should use this skill"

[18] [19]

۱X"ᆃᇕ႗ೖĠ -ᆃહ໙b Ď ğ -৘ğᄝ <DECISION>WRITE</DECISION>໙ีĠ -৘ğᄝ <DECISION>SKIP</DECISION>ඪૼჰၹđ҂ေᄜཿ໙ีb ֒ࣇWRITE ൈĎ ğ -ေ౰ğ{style_desc} -؇50~300ࡆן1~2Ġ - сྶ**๝ൈ**ླေᆃ {k}۱skillൔ

Output only the question itself; no prefixes, tags, or explanations. Skill: {skill_rendered} 17 D.3 Multi-skill prompts ( k = 2 / k = 3, two-stage CoT WRITE/SKIP) zh2zh — Chinese skill set → Chinese query. b ୆ {k}۱skillି৯čname / description / bodyൻԛğ ᆃ{k}۱skill ** ๝ൈ**ᄝ **߅/ླ౰** ѓሙğ -ປӮ҂ਔ / ૼཁҗಌĠ - ҂఼ྛ௓՘ğ҂ି൞"۱X"ᆃᇕ႗ೖĠ -ᆃહ໙b Ď ğ -৘ğᄝ <DECISION>WRITE</DECISI...