FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing
Pith reviewed 2026-05-22 07:04 UTC · model grok-4.3
The pith
FlyRoute evolves agent profiles from real traffic to raise LLM router accuracy from 72.57% to 89.83%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlyRoute is a self-evolving profiling framework that grows capability evidence from real traffic: dispatch candidates, quality-gate successful pairs into each agent's success store, periodically distill evidence into learned capability descriptions, and inject those descriptions together with BM25-retrieved successes into an LLM router. To keep the flywheel data-efficient, a targeted exploration policy combines profile uncertainty, BM25 relevance, and lexical novelty, prioritizing under-profiled agents only for plausible queries and avoiding redundant evidence collection.
What carries the argument
The data flywheel that dispatches queries to candidate agents, quality-gates successful pairs into per-agent success stores, distills them into capability descriptions, and retrieves those descriptions plus BM25 examples for the router.
If this is right
- Five seed queries per agent already raise accuracy from 72.57% to 78.04%.
- After streaming 7,211 labeled queries the accuracy reaches 89.83%, a 17.26-point gain over zero-shot and 11.79 points over the cold-start stage.
- Gains remain consistent across four expert domains when measured by standard routing accuracy on single-gold test queries.
- Simply adding profile retrieval from the success stores already strengthens performance in the cold-start phase.
Where Pith is reading between the lines
- Companies could run the flywheel continuously so that routing stays accurate as agents receive new tools or updated prompts without extra human effort.
- The same loop of dispatch, gate, distill, and retrieve might apply to routing among changing tools or models in general multi-agent assistants.
- Testing the flywheel on public multi-domain routing benchmarks would show whether the accuracy gains transfer beyond the proprietary developer-support dataset.
Load-bearing premise
The quality-gating step that decides which query-agent pairs count as successes is reliable and does not systematically favor certain agents or query types.
What would settle it
Re-running the router on the held-out test set after the flywheel has streamed the 7,211 training queries and finding accuracy no higher than the original 72.57% zero-shot baseline would falsify the central claim.
Figures
read the original abstract
Enterprise routers assign queries to expert agents, yet deployed profiles stay static while agents evolve (prompts, tools, models), and developers rarely keep descriptions or exemplars current. We present FlyRoute, a self-evolving profiling framework that grows capability evidence from real traffic: dispatch candidates, quality-gate successful pairs into each agent's success store, periodically distill evidence into learned capability descriptions, and inject those descriptions together with BM25-retrieved successes into an LLM router. To make this flywheel data-efficient, FlyRoute introduces a targeted exploration policy that combines profile uncertainty, BM25 relevance, and lexical novelty, prioritizing under-profiled agents only for plausible queries and avoiding redundant evidence collection. In experiments on our proprietary enterprise developer-support dataset of real routed queries, FlyRoute improves a same-backbone zero-shot LLM router from 72.57% to 78.04% with only five seed queries per agent, showing that profile retrieval already strengthens cold-start routing. After streaming 7,211 labeled training queries through the flywheel, accuracy rises to 89.83% (+17.26pp over zero-shot; +11.79pp over cold start), with consistent gains across four expert domains under standard routing accuracy on single-gold test queries.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FlyRoute, a self-evolving profiling framework for routing queries to expert agents in enterprise settings. It operates a data flywheel that dispatches candidate queries, quality-gates successful query-agent pairs into per-agent success stores, periodically distills the evidence into learned capability descriptions, and augments an LLM router with these descriptions plus BM25-retrieved exemplars. A targeted exploration policy combining profile uncertainty, BM25 relevance, and lexical novelty is used to collect evidence efficiently. On a proprietary enterprise developer-support dataset, the method is reported to raise same-backbone LLM router accuracy from 72.57% (zero-shot) to 78.04% using only five seed queries per agent, and further to 89.83% after streaming 7,211 labeled training queries (+17.26 pp over zero-shot, +11.79 pp over cold-start), with gains across four domains under single-gold routing accuracy.
Significance. If the empirical claims hold after addressing the noted gaps, FlyRoute provides a practical, low-manual-effort mechanism for keeping agent profiles current as agents evolve, which is a common pain point in production multi-agent routers. The data-flywheel design and targeted exploration policy are concrete contributions that could reduce profiling cost; the reported lift from a small seed set already demonstrates value even before large-scale traffic is processed. Reproducible code or public data would further increase impact.
major comments (2)
- [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central accuracy claims (72.57% zero-shot, 78.04% with five seeds, 89.83% after 7,211 queries) are presented as point estimates with no error bars, no mention of test-set size or variance across runs, and no ablation isolating the contribution of quality gating versus distillation versus BM25 retrieval. Because these numbers are the primary evidence for the flywheel's effectiveness, the absence of statistical characterization and component ablations makes it impossible to judge whether the +11.79 pp gain is robust or partly artifactual.
- [§3 (Flywheel and Exploration Policy)] §3 (Flywheel and Exploration Policy): The quality-gating step that populates the success stores is load-bearing for the unbiased-evidence assumption, yet its concrete criteria, thresholds, or LLM prompt are not specified. The targeted exploration policy explicitly selects “plausible queries” for under-profiled agents; without a distribution-shift diagnostic (e.g., lexical or embedding distance between gated successes and the held-out test queries), it remains possible that the distilled descriptions and retrieved exemplars over-represent easier or more representative query types, directly inflating the reported router accuracy.
minor comments (2)
- [§3.3] The free parameters of the exploration policy (combination weights) are mentioned but not given explicit values or sensitivity analysis; a short table or paragraph stating the values used would improve reproducibility.
- [§4] Figure captions and axis labels in the experimental plots should explicitly state the number of test queries and whether the plotted points are single runs or averages.
Simulated Author's Rebuttal
Thank you for the referee's thoughtful review. We appreciate the feedback on strengthening the empirical evaluation and clarifying the methodology. We address each major comment below and outline the revisions.
read point-by-point responses
-
Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central accuracy claims (72.57% zero-shot, 78.04% with five seeds, 89.83% after 7,211 queries) are presented as point estimates with no error bars, no mention of test-set size or variance across runs, and no ablation isolating the contribution of quality gating versus distillation versus BM25 retrieval. Because these numbers are the primary evidence for the flywheel's effectiveness, the absence of statistical characterization and component ablations makes it impossible to judge whether the +11.79 pp gain is robust or partly artifactual.
Authors: We agree that providing error bars, test-set details, and component ablations would improve the robustness assessment of our results. In the revised version, we will specify the test-set size and add bootstrap confidence intervals to the reported accuracies. We will also include ablations to isolate the effects of quality gating, distillation, and BM25 retrieval. While the proprietary and streaming nature of the dataset limits our ability to perform multiple independent full flywheel runs for variance estimation, we will provide the statistical measures that are feasible. revision: yes
-
Referee: [§3 (Flywheel and Exploration Policy)] §3 (Flywheel and Exploration Policy): The quality-gating step that populates the success stores is load-bearing for the unbiased-evidence assumption, yet its concrete criteria, thresholds, or LLM prompt are not specified. The targeted exploration policy explicitly selects “plausible queries” for under-profiled agents; without a distribution-shift diagnostic (e.g., lexical or embedding distance between gated successes and the held-out test queries), it remains possible that the distilled descriptions and retrieved exemplars over-represent easier or more representative query types, directly inflating the reported router accuracy.
Authors: We thank the referee for highlighting the need for greater transparency in the quality-gating process. We will revise §3 to explicitly describe the concrete criteria, thresholds, and the LLM prompt employed for quality gating. To address the potential distribution shift, we will incorporate a diagnostic analysis comparing lexical overlap and embedding similarities between the gated successful queries and the held-out test set, ensuring that the evidence collection does not bias toward simpler queries. revision: yes
Circularity Check
No significant circularity; empirical results on held-out test set
full rationale
The paper presents FlyRoute as a procedural framework that streams real queries through dispatch, quality-gating, distillation into capability descriptions, and BM25 retrieval for an LLM router, then reports measured accuracy gains on held-out single-gold test queries (72.57% zero-shot to 89.83% after 7,211 queries). No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the accuracy figures are external evaluation outcomes rather than predictions forced by the paper's own equations or ansatzes. The derivation chain is therefore self-contained against the described dataset and test protocol.
Axiom & Free-Parameter Ledger
free parameters (1)
- exploration policy combination weights
axioms (1)
- domain assumption Quality gate correctly labels successful query-agent pairs without systematic bias
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FlyRoute introduces an uncertainty-driven exploration policy that combines profile uncertainty, BM25 relevance, and lexical novelty... V_explore(q, ai) = U(ai) · R(q, ai) ... quality gate decides which query–agent pairs enter the success store
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
periodically distill the accumulated evidence into a refined capability description... after streaming 7,211 labeled training queries
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
MoDEM: Mixture of domain expert models. Preprint, arXiv:2410.07490. Charles Tran, Sarim Paracha, Abdul Hafeez, and Shiyi Chen. 2025. Arch-router: Aligning LLM routing with human preferences.Preprint, arXiv:2506.16655. Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jian- hao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, and Shuyue Hu. 2026. ICL-Router: In-con...
-
[2]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
AutoGen: Enabling next-gen LLM appli- cations via multi-agent conversation.Preprint, arXiv:2308.08155. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations. Tony Zhang, Ali Mehradfar, Dimitris Dimi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Mobile OS: HarmonyOS, Ability Kit, ArkTS, ArkUI, HarmonyOS Next, etc. [Agent profile summary] Below are global capability statements for each route (seed text at registration; after distillation, summaries induced from historical successes, to capture shared competence beyond the retrieved examples): {profile_block} [BM25 retrieval examples](successful qu...
-
[7]
Output exactly one agent dispatch per query: choose one of Cloud Services, AI Accelerator, Server Hardware, Mobile OS defined in the deployment template
-
[8]
Output route names only—no explanation or extra text. Figure 2: Prompt for FlyRoute. Notice that the live system prompt uses Chinese-language tokens; English labels shown here for readability. FlyRoute capability distillation —userprompt (English translation) You are an expert at analyzing AI agent capabilities. From the information below, produce aconcis...
-
[9]
Ground the summary only in queries that actually succeeded; do not fabricate capabilities
-
[10]
If empirical behavior differs from the initial description or the prior distilled summary, follow the newest evidence; retain prior wording that still holds and revise only when evidence conflicts. 3.Keep the description concise: ≤500 characters in the deployed Chinese template (We state these rules in English for readability)
-
[11]
If no prior distilled description exists,{prev_learned_block}is omitted. [Output format] <Description>. . . capability text (within the character cap) . . .</Description> Figure 3: Distillation prompt for FlyRoute. Notice that the live template is in Chinese; we translate it to English here for readability. FlyRoute LLM-as-Judge — prompts (English transla...
-
[12]
Cloud Services: Cloud product resources, tools, services, specifications, purchase, documentation, etc
-
[13]
AI Accelerator: Ascend stack, CANN, MindSpore, operators, model training and inference tooling, etc
-
[14]
Server Hardware: Kunpeng, openEuler, DevKit, BoostKit, HPC, acceleration libraries, storage, etc
-
[15]
Mobile OS: HarmonyOS, Ability Kit, ArkTS, ArkUI, HarmonyOS Next, etc. [Scoring principles]
-
[16]
r must address the technical subject of q; off-topic chatter, unrelated marketing, or empty acknowledgments lowers the score
-
[17]
If r sprawls largely outside those four competencies with weak grounding, penalize under “off-topic/low applicability.”
-
[18]
Responses that are blank, boilerplate refusals, or contain no substantive guidance should score≤0.35
-
[19]
Partially helpful but misses the crux earns mid-range scores; only concrete, actionable, mostly non-contradictory answers earn high scores (watch for hallucination)
-
[20]
The FlyRoute gate typically appliesθ≈0.70: slightly beyond “barely usable” denotes evidence worth caching. [Optional calibration block]Few-shot excerpts can be inlined here via {examples_optional} when enabled in code. [Output schema]Return a single JSON object. Use ASCII quotes only; omit Markdown wrappers. The mandatory key is quality_score, whose value...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.