FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing

Rongjun Li; Yihang Wu; Ziyu Zhou

arxiv: 2605.22057 · v1 · pith:DHKLRZJVnew · submitted 2026-05-21 · 💻 cs.CL

FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing

Rongjun Li , Ziyu Zhou , Yihang Wu This is my paper

Pith reviewed 2026-05-22 07:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-evolving profilingdata flywheeltask routingLLM routeragent capabilityadaptive routingenterprise supporttargeted exploration

0 comments

The pith

FlyRoute evolves agent profiles from real traffic to raise LLM router accuracy from 72.57% to 89.83%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Enterprise routers must send queries to the right expert agents, but static agent descriptions quickly fall out of date as prompts, tools, and models change. FlyRoute runs a data flywheel that sends candidate agents to incoming queries, keeps only the successful matches after quality checks, distills those matches into updated capability descriptions, and feeds the descriptions plus retrieved examples back into the router. A targeted exploration policy focuses effort on under-profiled agents and novel or uncertain queries to collect evidence efficiently. A sympathetic reader would care because the approach cuts the need for constant manual updates while lifting routing performance in settings where queries must reach the correct specialist.

Core claim

FlyRoute is a self-evolving profiling framework that grows capability evidence from real traffic: dispatch candidates, quality-gate successful pairs into each agent's success store, periodically distill evidence into learned capability descriptions, and inject those descriptions together with BM25-retrieved successes into an LLM router. To keep the flywheel data-efficient, a targeted exploration policy combines profile uncertainty, BM25 relevance, and lexical novelty, prioritizing under-profiled agents only for plausible queries and avoiding redundant evidence collection.

What carries the argument

The data flywheel that dispatches queries to candidate agents, quality-gates successful pairs into per-agent success stores, distills them into capability descriptions, and retrieves those descriptions plus BM25 examples for the router.

If this is right

Five seed queries per agent already raise accuracy from 72.57% to 78.04%.
After streaming 7,211 labeled queries the accuracy reaches 89.83%, a 17.26-point gain over zero-shot and 11.79 points over the cold-start stage.
Gains remain consistent across four expert domains when measured by standard routing accuracy on single-gold test queries.
Simply adding profile retrieval from the success stores already strengthens performance in the cold-start phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Companies could run the flywheel continuously so that routing stays accurate as agents receive new tools or updated prompts without extra human effort.
The same loop of dispatch, gate, distill, and retrieve might apply to routing among changing tools or models in general multi-agent assistants.
Testing the flywheel on public multi-domain routing benchmarks would show whether the accuracy gains transfer beyond the proprietary developer-support dataset.

Load-bearing premise

The quality-gating step that decides which query-agent pairs count as successes is reliable and does not systematically favor certain agents or query types.

What would settle it

Re-running the router on the held-out test set after the flywheel has streamed the 7,211 training queries and finding accuracy no higher than the original 72.57% zero-shot baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22057 by Rongjun Li, Yihang Wu, Ziyu Zhou.

**Figure 2.** Figure 2: Prompt for FlyRoute. Notice that the live system prompt uses Chinese-language tokens; English labels [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Distillation prompt for FlyRoute. Notice that the live template is in Chinese; we translate it to English [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: LLM-as-Judge prompts for FlyRoute. The deployed templates are Chinese; we convert them to English [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: FlyRoute learned descriptions. The deployed templates are Chinese; we convert them to English for [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Enterprise routers assign queries to expert agents, yet deployed profiles stay static while agents evolve (prompts, tools, models), and developers rarely keep descriptions or exemplars current. We present FlyRoute, a self-evolving profiling framework that grows capability evidence from real traffic: dispatch candidates, quality-gate successful pairs into each agent's success store, periodically distill evidence into learned capability descriptions, and inject those descriptions together with BM25-retrieved successes into an LLM router. To make this flywheel data-efficient, FlyRoute introduces a targeted exploration policy that combines profile uncertainty, BM25 relevance, and lexical novelty, prioritizing under-profiled agents only for plausible queries and avoiding redundant evidence collection. In experiments on our proprietary enterprise developer-support dataset of real routed queries, FlyRoute improves a same-backbone zero-shot LLM router from 72.57% to 78.04% with only five seed queries per agent, showing that profile retrieval already strengthens cold-start routing. After streaming 7,211 labeled training queries through the flywheel, accuracy rises to 89.83% (+17.26pp over zero-shot; +11.79pp over cold start), with consistent gains across four expert domains under standard routing accuracy on single-gold test queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlyRoute gives a workable flywheel for updating agent profiles from real traffic with targeted exploration, but the accuracy gains rest on unexamined selection effects in gating and evidence collection.

read the letter

The main thing here is a concrete loop that turns live queries into better router profiles without manual upkeep. It dispatches candidates, quality-gates successes into per-agent stores, distills them into capability descriptions, and injects those plus BM25-retrieved examples into the LLM router prompt. The targeted exploration policy mixes profile uncertainty, BM25 relevance, and lexical novelty to focus on under-profiled agents only for plausible queries. That combination is what feels new rather than a straight lift from prior routing or continual-learning papers.

Referee Report

2 major / 2 minor

Summary. The paper introduces FlyRoute, a self-evolving profiling framework for routing queries to expert agents in enterprise settings. It operates a data flywheel that dispatches candidate queries, quality-gates successful query-agent pairs into per-agent success stores, periodically distills the evidence into learned capability descriptions, and augments an LLM router with these descriptions plus BM25-retrieved exemplars. A targeted exploration policy combining profile uncertainty, BM25 relevance, and lexical novelty is used to collect evidence efficiently. On a proprietary enterprise developer-support dataset, the method is reported to raise same-backbone LLM router accuracy from 72.57% (zero-shot) to 78.04% using only five seed queries per agent, and further to 89.83% after streaming 7,211 labeled training queries (+17.26 pp over zero-shot, +11.79 pp over cold-start), with gains across four domains under single-gold routing accuracy.

Significance. If the empirical claims hold after addressing the noted gaps, FlyRoute provides a practical, low-manual-effort mechanism for keeping agent profiles current as agents evolve, which is a common pain point in production multi-agent routers. The data-flywheel design and targeted exploration policy are concrete contributions that could reduce profiling cost; the reported lift from a small seed set already demonstrates value even before large-scale traffic is processed. Reproducible code or public data would further increase impact.

major comments (2)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central accuracy claims (72.57% zero-shot, 78.04% with five seeds, 89.83% after 7,211 queries) are presented as point estimates with no error bars, no mention of test-set size or variance across runs, and no ablation isolating the contribution of quality gating versus distillation versus BM25 retrieval. Because these numbers are the primary evidence for the flywheel's effectiveness, the absence of statistical characterization and component ablations makes it impossible to judge whether the +11.79 pp gain is robust or partly artifactual.
[§3 (Flywheel and Exploration Policy)] §3 (Flywheel and Exploration Policy): The quality-gating step that populates the success stores is load-bearing for the unbiased-evidence assumption, yet its concrete criteria, thresholds, or LLM prompt are not specified. The targeted exploration policy explicitly selects “plausible queries” for under-profiled agents; without a distribution-shift diagnostic (e.g., lexical or embedding distance between gated successes and the held-out test queries), it remains possible that the distilled descriptions and retrieved exemplars over-represent easier or more representative query types, directly inflating the reported router accuracy.

minor comments (2)

[§3.3] The free parameters of the exploration policy (combination weights) are mentioned but not given explicit values or sensitivity analysis; a short table or paragraph stating the values used would improve reproducibility.
[§4] Figure captions and axis labels in the experimental plots should explicitly state the number of test queries and whether the plotted points are single runs or averages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's thoughtful review. We appreciate the feedback on strengthening the empirical evaluation and clarifying the methodology. We address each major comment below and outline the revisions.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central accuracy claims (72.57% zero-shot, 78.04% with five seeds, 89.83% after 7,211 queries) are presented as point estimates with no error bars, no mention of test-set size or variance across runs, and no ablation isolating the contribution of quality gating versus distillation versus BM25 retrieval. Because these numbers are the primary evidence for the flywheel's effectiveness, the absence of statistical characterization and component ablations makes it impossible to judge whether the +11.79 pp gain is robust or partly artifactual.

Authors: We agree that providing error bars, test-set details, and component ablations would improve the robustness assessment of our results. In the revised version, we will specify the test-set size and add bootstrap confidence intervals to the reported accuracies. We will also include ablations to isolate the effects of quality gating, distillation, and BM25 retrieval. While the proprietary and streaming nature of the dataset limits our ability to perform multiple independent full flywheel runs for variance estimation, we will provide the statistical measures that are feasible. revision: yes
Referee: [§3 (Flywheel and Exploration Policy)] §3 (Flywheel and Exploration Policy): The quality-gating step that populates the success stores is load-bearing for the unbiased-evidence assumption, yet its concrete criteria, thresholds, or LLM prompt are not specified. The targeted exploration policy explicitly selects “plausible queries” for under-profiled agents; without a distribution-shift diagnostic (e.g., lexical or embedding distance between gated successes and the held-out test queries), it remains possible that the distilled descriptions and retrieved exemplars over-represent easier or more representative query types, directly inflating the reported router accuracy.

Authors: We thank the referee for highlighting the need for greater transparency in the quality-gating process. We will revise §3 to explicitly describe the concrete criteria, thresholds, and the LLM prompt employed for quality gating. To address the potential distribution shift, we will incorporate a diagnostic analysis comparing lexical overlap and embedding similarities between the gated successful queries and the held-out test set, ensuring that the evidence collection does not bias toward simpler queries. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on held-out test set

full rationale

The paper presents FlyRoute as a procedural framework that streams real queries through dispatch, quality-gating, distillation into capability descriptions, and BM25 retrieval for an LLM router, then reports measured accuracy gains on held-out single-gold test queries (72.57% zero-shot to 89.83% after 7,211 queries). No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the accuracy figures are external evaluation outcomes rather than predictions forced by the paper's own equations or ansatzes. The derivation chain is therefore self-contained against the described dataset and test protocol.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that success pairs can be reliably identified and that the exploration policy gathers unbiased evidence; no free parameters are explicitly named but the exploration weights are likely tuned.

free parameters (1)

exploration policy combination weights
The policy blends profile uncertainty, BM25 relevance, and lexical novelty; the relative weighting is not stated as derived from first principles.

axioms (1)

domain assumption Quality gate correctly labels successful query-agent pairs without systematic bias
The flywheel depends on this gate to populate the success stores that later become the learned profiles.

pith-pipeline@v0.9.0 · 5752 in / 1406 out tokens · 73776 ms · 2026-05-22T07:04:25.283573+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FlyRoute introduces an uncertainty-driven exploration policy that combines profile uncertainty, BM25 relevance, and lexical novelty... V_explore(q, ai) = U(ai) · R(q, ai) ... quality gate decides which query–agent pairs enter the success store
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

periodically distill the accumulated evidence into a refined capability description... after streaming 7,211 labeled training queries

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Preprint, arXiv:2410.07490

MoDEM: Mixture of domain expert models. Preprint, arXiv:2410.07490. Charles Tran, Sarim Paracha, Abdul Hafeez, and Shiyi Chen. 2025. Arch-router: Aligning LLM routing with human preferences.Preprint, arXiv:2506.16655. Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jian- hao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, and Shuyue Hu. 2026. ICL-Router: In-con...

work page arXiv 2025
[2]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

AutoGen: Enabling next-gen LLM appli- cations via multi-agent conversation.Preprint, arXiv:2308.08155. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations. Tony Zhang, Ali Mehradfar, Dimitris Dimi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Mobile OS: HarmonyOS, Ability Kit, ArkTS, ArkUI, HarmonyOS Next, etc. [Agent profile summary] Below are global capability statements for each route (seed text at registration; after distillation, summaries induced from historical successes, to capture shared competence beyond the retrieved examples): {profile_block} [BM25 retrieval examples](successful qu...

work page
[7]

Output exactly one agent dispatch per query: choose one of Cloud Services, AI Accelerator, Server Hardware, Mobile OS defined in the deployment template

work page
[8]

Figure 2: Prompt for FlyRoute

Output route names only—no explanation or extra text. Figure 2: Prompt for FlyRoute. Notice that the live system prompt uses Chinese-language tokens; English labels shown here for readability. FlyRoute capability distillation —userprompt (English translation) You are an expert at analyzing AI agent capabilities. From the information below, produce aconcis...

work page
[9]

Ground the summary only in queries that actually succeeded; do not fabricate capabilities

work page
[10]

3.Keep the description concise: ≤500 characters in the deployed Chinese template (We state these rules in English for readability)

If empirical behavior differs from the initial description or the prior distilled summary, follow the newest evidence; retain prior wording that still holds and revise only when evidence conflicts. 3.Keep the description concise: ≤500 characters in the deployed Chinese template (We state these rules in English for readability)

work page
[11]

success-store

If no prior distilled description exists,{prev_learned_block}is omitted. [Output format] <Description>. . . capability text (within the character cap) . . .</Description> Figure 3: Distillation prompt for FlyRoute. Notice that the live template is in Chinese; we translate it to English here for readability. FlyRoute LLM-as-Judge — prompts (English transla...

work page
[12]

Cloud Services: Cloud product resources, tools, services, specifications, purchase, documentation, etc

work page
[13]

AI Accelerator: Ascend stack, CANN, MindSpore, operators, model training and inference tooling, etc

work page
[14]

Server Hardware: Kunpeng, openEuler, DevKit, BoostKit, HPC, acceleration libraries, storage, etc

work page
[15]

[Scoring principles]

Mobile OS: HarmonyOS, Ability Kit, ArkTS, ArkUI, HarmonyOS Next, etc. [Scoring principles]

work page
[16]

r must address the technical subject of q; off-topic chatter, unrelated marketing, or empty acknowledgments lowers the score

work page
[17]

off-topic/low applicability

If r sprawls largely outside those four competencies with weak grounding, penalize under “off-topic/low applicability.”

work page
[18]

Responses that are blank, boilerplate refusals, or contain no substantive guidance should score≤0.35

work page
[19]

Partially helpful but misses the crux earns mid-range scores; only concrete, actionable, mostly non-contradictory answers earn high scores (watch for hallucination)

work page
[20]

barely usable

The FlyRoute gate typically appliesθ≈0.70: slightly beyond “barely usable” denotes evidence worth caching. [Optional calibration block]Few-shot excerpts can be inlined here via {examples_optional} when enabled in code. [Output schema]Return a single JSON object. Use ASCII quotes only; omit Markdown wrappers. The mandatory key is quality_score, whose value...

work page

[1] [1]

Preprint, arXiv:2410.07490

MoDEM: Mixture of domain expert models. Preprint, arXiv:2410.07490. Charles Tran, Sarim Paracha, Abdul Hafeez, and Shiyi Chen. 2025. Arch-router: Aligning LLM routing with human preferences.Preprint, arXiv:2506.16655. Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jian- hao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, and Shuyue Hu. 2026. ICL-Router: In-con...

work page arXiv 2025

[2] [2]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

AutoGen: Enabling next-gen LLM appli- cations via multi-agent conversation.Preprint, arXiv:2308.08155. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations. Tony Zhang, Ali Mehradfar, Dimitris Dimi...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [6]

Mobile OS: HarmonyOS, Ability Kit, ArkTS, ArkUI, HarmonyOS Next, etc. [Agent profile summary] Below are global capability statements for each route (seed text at registration; after distillation, summaries induced from historical successes, to capture shared competence beyond the retrieved examples): {profile_block} [BM25 retrieval examples](successful qu...

work page

[4] [7]

Output exactly one agent dispatch per query: choose one of Cloud Services, AI Accelerator, Server Hardware, Mobile OS defined in the deployment template

work page

[5] [8]

Figure 2: Prompt for FlyRoute

Output route names only—no explanation or extra text. Figure 2: Prompt for FlyRoute. Notice that the live system prompt uses Chinese-language tokens; English labels shown here for readability. FlyRoute capability distillation —userprompt (English translation) You are an expert at analyzing AI agent capabilities. From the information below, produce aconcis...

work page

[6] [9]

Ground the summary only in queries that actually succeeded; do not fabricate capabilities

work page

[7] [10]

3.Keep the description concise: ≤500 characters in the deployed Chinese template (We state these rules in English for readability)

If empirical behavior differs from the initial description or the prior distilled summary, follow the newest evidence; retain prior wording that still holds and revise only when evidence conflicts. 3.Keep the description concise: ≤500 characters in the deployed Chinese template (We state these rules in English for readability)

work page

[8] [11]

success-store

If no prior distilled description exists,{prev_learned_block}is omitted. [Output format] <Description>. . . capability text (within the character cap) . . .</Description> Figure 3: Distillation prompt for FlyRoute. Notice that the live template is in Chinese; we translate it to English here for readability. FlyRoute LLM-as-Judge — prompts (English transla...

work page

[9] [12]

Cloud Services: Cloud product resources, tools, services, specifications, purchase, documentation, etc

work page

[10] [13]

AI Accelerator: Ascend stack, CANN, MindSpore, operators, model training and inference tooling, etc

work page

[11] [14]

Server Hardware: Kunpeng, openEuler, DevKit, BoostKit, HPC, acceleration libraries, storage, etc

work page

[12] [15]

[Scoring principles]

Mobile OS: HarmonyOS, Ability Kit, ArkTS, ArkUI, HarmonyOS Next, etc. [Scoring principles]

work page

[13] [16]

r must address the technical subject of q; off-topic chatter, unrelated marketing, or empty acknowledgments lowers the score

work page

[14] [17]

off-topic/low applicability

If r sprawls largely outside those four competencies with weak grounding, penalize under “off-topic/low applicability.”

work page

[15] [18]

Responses that are blank, boilerplate refusals, or contain no substantive guidance should score≤0.35

work page

[16] [19]

Partially helpful but misses the crux earns mid-range scores; only concrete, actionable, mostly non-contradictory answers earn high scores (watch for hallucination)

work page

[17] [20]

barely usable

The FlyRoute gate typically appliesθ≈0.70: slightly beyond “barely usable” denotes evidence worth caching. [Optional calibration block]Few-shot excerpts can be inlined here via {examples_optional} when enabled in code. [Output schema]Return a single JSON object. Use ASCII quotes only; omit Markdown wrappers. The mandatory key is quality_score, whose value...

work page