arxiv: 2604.01608 · v3 · submitted 2026-04-02 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?

Binyan Xu , Dong Fang , Haitao Li , Kehuan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords skill distillationmulti-agent systemssingle-agentevaluation metricsmetric freedomadaptive distillationagent trajectories

0 comments

The pith

Skill distillation from multi-agent to single-agent succeeds or fails based on the evaluation metric's freedom, not the task itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that distilling multi-agent systems into single agents improves performance only when the evaluation metric has high topological freedom, as measured by how output diversity couples with score variance. This metric-level property explains inconsistent empirical results where the same trajectories yield gains or losses depending on the scorer. By introducing Metric Freedom as a predictor, the work allows practitioners to decide in advance whether distillation will be worthwhile, leading to an adaptive method that selectively extracts and refines skills to achieve similar or better results at far lower cost and latency. A sympathetic reader would care because it turns an empirical gamble into a calculable decision, potentially making complex agent systems more practical for deployment.

Core claim

The central claim is that skill utility is governed not by the task but by the evaluation metric. The authors introduce Metric Freedom (F), quantified via Mantel test on diversity-score coupling, which strongly predicts distillation outcomes (r = -0.85). They show that identical agent trajectories produce opposite skill lifts under rigid versus free metrics, proving the property is metric-level. Based on this, they develop AdaSkill, a two-stage framework that extracts selectively on free metrics and refines iteratively to maximize headroom while matching or exceeding MAS performance with up to 8x cost reduction.

What carries the argument

Metric Freedom (F), which measures the topological rigidity of a metric's scoring landscape by quantifying the coupling between output diversity and score variance through a Mantel test; it serves as an a priori predictor that determines whether distillation preserves or harms performance.

Load-bearing premise

The topological rigidity captured by the Mantel test on diversity-score coupling is the primary causal driver of distillation success and generalizes beyond the 6 metrics and 11 datasets tested.

What would settle it

Finding a new collection of metrics or tasks where the correlation between Metric Freedom and observed skill utility falls below statistical significance or where identical trajectories no longer produce opposite skill lifts under rigid versus free metrics.

Figures

Figures reproduced from arXiv: 2604.01608 by Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang.

**Figure 2.** Figure 2: Task-level performance, cost, and latency overview. AdaSkill matches or outperforms all baselines on accuracy while preserving cost and latency. Per-dataset breakdowns in Appendix E. • Analyzer is spawned statelessly per iteration to diagnose the 1–3 most severe failure traces. Zero memory between iterations prevents compounding hallucinations. It classifies root causes and atomically injects fixes into to… view at source ↗

**Figure 3.** Figure 3: Metric freedom F predicts performance lift of skills. Both output-space (a) and reasoning-space (b) measures confirm the negative trend (r=−0.85 and r=−0.77), validating F as a predictor of skill utility. Large circles = metric-level aggregates; small circles = individual datasets. underperforms both the Base Agent and our Adaskill, proving that simply copying agentic structure is suboptimal without F-guid… view at source ↗

**Figure 4.** Figure 4: F quantifies whether path differences predict metric outcomes. Each point pairs two raw agent runs: x = path distance, y = metric difference. A tighter fit yields lower F, indicating that skillbased path control reliably shifts the metric. Global Correlation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity of F to evaluation budget (M, N). The gold line marks our operating point (N=6, M=6, $6.12), which reaches balance between reliability and cost. Text-to-SQL. At midpoint (F=0.50), our framework predicts moderate but reliable benefit ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation validates F-guided component selection. Top row: headroom gain (comp − raw)/(max − min); bottom row: ∆Cost in USD; bars = ±1 std. Tools and knowledge yield gains; pipeline (c, r=−0.83) hurts on high-F metrics, motivating selective application. reduces latency from ∼10 hours to 45 minutes, confirming that minimal structure is optimal when the scoring landscape is flat (full results in Appendix D.2)… view at source ↗

**Figure 7.** Figure 7: Ablation on pipeline configuration. Adaptive Distill (F-guided) outperforms both the full-pipeline and no-pipeline baselines overall. The fullpipeline baseline favors low-F; the nopipeline baseline favors high-F. Component Attribution. To isolate why F predicts lift, we ablate skills into atomic components ( [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Diversity planner ablation on F estimation. Both panels show FMSA and FMRE (×100) as a function of evaluation budget (M questions, N runs). At large budgets (M=N=20) both methods converge to similar estimates (FMSA: 55 vs. 53; FMRE: 85 vs. 85). However, at our operating point (M=6, N=6), the diversity planner achieves < 5% error (FMSA=52, FMRE=86) while independent runs show ∼20% error (FMSA=42, FMRE=78). … view at source ↗

**Figure 9.** Figure 9: Per-run distribution of F out across all metrics. Dot size ∝ questions per run; colour encodes domain; labels show dataset for Causal Discovery. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 F out (output-space freedom) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 F trace obs (re asoning-spa c e fre edo m) Pearson r = 0.50 n = 49 question-metric pairs Agreement Between F out and F trace AUC EX F1 MRE MSA NSF SRR OLS (r = 0.50… view at source ↗

**Figure 10.** Figure 10: Agreement between F out and F trace. Each point is one (metric, dataset) tuple; OLS line and Pearson r shown [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Metric freedom F predicts skill lift under GPT-5.1 (backbone generalization). Replication of the Freedom Spectrum analysis with GPT-5.1 as the backbone. Both Fout (a) and Ftrace (b) preserve the negative trend (r=−0.71, p<0.01 and r=−0.79, p<0.001). Dotted blue lines show the Sonnet 4.6 reference trend from [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: provides a complete per-dataset breakdown of performance (Row 1), cache-hit cost (Row 2), and latency (Row 3) across all 11 datasets, complementing the summary scatter plot in [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Stage 2 iterator trajectories validate F-dependent convergence behavior. Low-F CE-MSA (a) gains quickly then oscillates, illustrating the knife-edge landscape risk. Mid-to-high-F tasks—CD (b), T2SQL (c), and FE (d)—improve steadily and plateau cleanly, confirming that safe monotonic refinement is achievable precisely where Stage 1 leaves the most headroom. Solid = val; dashed = train; green line = selecte… view at source ↗

**Figure 14.** Figure 14: Stage 1 architecture transformation for CE (FMSA ≈ 0). Left: the original CAIS MAS has 8 LLM-backed agents on a shared state dict (8 red dots = 8 sequential LLM calls/query). Right: the adaptive skill retains tools and knowledge as freely-invocable layered modules and discards pipeline ordering and agent coordination. LLM calls reduce to 1–3× while CE-MSA gains +28 pp. What changes in the representation o… view at source ↗

**Figure 15.** Figure 15: Stage 2 iterator trajectory for CE-MSA. Version boxes show Val MSA; v2 (green, bold border) is the globally best version. Iteration 1 applies two concurrent fixes (+20 pp, 60%→80%); iteration 2 applies one further fix (+20 pp, 80%→100%). Iterations 3–4 enter oscillation (orange shaded region): each patch repairs one rule but inadvertently breaks another, yielding no net gain. The iterator identifies v2 an… view at source ↗

read the original abstract

Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom (F), the first a priori predictor of skill utility. F measures the topological rigidity of a metric's scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by F, we propose AdaSkill, a two-stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on "free" metrics to preserve exploration. Stage 2 applies iterative refinement selectively on free metrics, exploiting their forgiving scoring landscape to safely maximize remaining headroom. Evaluating across 4 tasks, 11 datasets, and 6 metrics, F strongly predicts skill utility (r=-0.85, p<0.0001). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric-level property. Driven by this signal, AdaSkill matches or exceeds the original MAS while reducing cost up to 8x and latency by up to 15x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that distillation gains depend on the metric's scoring landscape rather than the task, but Metric Freedom still needs sampling so it is not a true pre-run predictor.

read the letter

The core observation is that the same multi-agent trajectories can produce large skill gains or small losses depending on whether the metric is rigid or forgiving, and they tie this to a new quantity called Metric Freedom computed via Mantel test on diversity-score coupling. That leads to the AdaSkill two-stage procedure that extracts selectively on free metrics and refines only where the landscape allows it. The reported correlation of -0.85 across 11 datasets and 6 metrics is the strongest part of the evidence, and the cost and latency reductions up to 8x and 15x are concrete if the correlation holds in the full runs. The experiments cover enough tasks to make the metric-dependence claim plausible rather than anecdotal. The main limitation is that computing F requires generating multiple outputs and scores first, so the predictor cannot be obtained without already paying the coordination cost one hopes to avoid. That undercuts the claim that F supplies an a priori decision rule. The paper would benefit from showing whether F can be approximated from smaller samples or from task descriptions alone. This work is aimed at groups already running multi-agent agents who want a practical heuristic for when to collapse them. It is coherent enough on its own terms to deserve referee time, even though the a priori framing needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that distilling multi-agent systems into single-agent skills yields inconsistent results (28% lift to 2% degradation) that are governed by the evaluation metric rather than the task. It introduces Metric Freedom (F), computed via a Mantel test on the coupling between output diversity and score variance, as the first a priori predictor of skill utility (reported r=-0.85, p<0.0001). The authors further claim that identical trajectories produce opposite skill lifts under rigid vs. free metrics, and propose the two-stage AdaSkill framework that selectively extracts and refines to match or exceed MAS performance at up to 8x lower cost.

Significance. If the correlation is robust and F can be operationalized without full post-sampling, the work would be significant for MAS research by supplying a concrete, metric-level criterion for deciding when distillation is beneficial. The demonstration that utility is a property of the scoring landscape rather than the underlying trajectories is a useful reframing, and the reproducible correlation across 4 tasks/11 datasets/6 metrics plus the cost-reduction results of AdaSkill could influence practical deployment of agent systems.

major comments (2)

[Abstract and §3] Abstract and §3 (Metric Freedom definition): The claim that F is an 'a priori predictor' that can be obtained 'before deciding whether or how to distill' is contradicted by the computation itself. F requires generating multiple outputs to compute pairwise distances (diversity) and score variance before applying the Mantel test; this sampling step cannot be performed without model inference, so F is necessarily post-sampling and cannot serve as a pre-distillation selector without already running the trajectories whose utility it is meant to predict.
[Results] Results section (correlation reporting): The central r=-0.85 correlation is presented without error bars, sensitivity analysis to sampling seed or number of samples, or controls for confounders such as task length, dataset size, or metric scale. This weakens the load-bearing claim that F 'strongly predicts' utility and that the result generalizes beyond the 6 metrics tested.

minor comments (2)

[§3] Notation for F and the Mantel statistic should be defined with an explicit equation (currently only described in prose) so that readers can reproduce the exact coupling measure.
[Figure 4 or Table 2] The abstract states 'identical agent trajectories yield diametrically opposite skill lifts'; the corresponding figure or table should report the exact trajectories and metric pairs used for this demonstration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments help clarify the practical scope of Metric Freedom and strengthen the statistical presentation of our results. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Metric Freedom definition): The claim that F is an 'a priori predictor' that can be obtained 'before deciding whether or how to distill' is contradicted by the computation itself. F requires generating multiple outputs to compute pairwise distances (diversity) and score variance before applying the Mantel test; this sampling step cannot be performed without model inference, so F is necessarily post-sampling and cannot serve as a pre-distillation selector without already running the trajectories whose utility it is meant to predict.

Authors: We agree that the original wording overstated the pre-inference nature of F. Computing F requires a small pilot sample (typically 10–20 trajectories per task), which necessarily involves model inference. However, this cost is substantially lower than full multi-agent execution or complete distillation. We will revise the abstract and §3 to describe F as a low-cost, post-pilot predictor that can be obtained before committing to full-scale distillation, rather than claiming it is strictly a priori. This adjustment preserves the practical utility while accurately reflecting the computation. revision: partial
Referee: [Results] Results section (correlation reporting): The central r=-0.85 correlation is presented without error bars, sensitivity analysis to sampling seed or number of samples, or controls for confounders such as task length, dataset size, or metric scale. This weakens the load-bearing claim that F 'strongly predicts' utility and that the result generalizes beyond the 6 metrics tested.

Authors: We accept this critique and will strengthen the statistical reporting. The revised Results section will include bootstrap-derived 95% confidence intervals on the reported correlation, sensitivity analyses across sample sizes (5–50 trajectories) and random seeds, and explicit controls for task length, dataset size, and metric scale. These additional checks confirm that the correlation remains stable (r ≈ −0.82 to −0.87) under the tested variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; F is independently computed and validated via correlation on held-out structure

full rationale

The paper defines Metric Freedom (F) via Mantel test on pairwise output diversity versus score variance matrices obtained from sampled trajectories. It then reports a cross-task correlation r=-0.85 between these F values and observed skill-utility lifts. No equation or procedure shows F being regressed, optimized, or algebraically reduced against the utility numbers themselves; the correlation is presented as an empirical validation rather than a fitted predictor. The sampling step needed to obtain diversity and variance is a computational prerequisite but does not make the reported relationship tautological or self-definitional. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the provided derivation chain. The central claim therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the new definition of F via Mantel test and the assumption that this statistical property governs distillation outcomes across tasks.

axioms (1)

domain assumption Mantel test assumptions hold for the scoring landscapes of the evaluated metrics
Invoked to quantify coupling between output diversity and score variance.

invented entities (1)

Metric Freedom (F) no independent evidence
purpose: A priori predictor of skill distillation utility
Newly introduced measure without independent evidence outside the paper's experiments.

pith-pipeline@v0.9.0 · 5593 in / 1301 out tokens · 74855 ms · 2026-05-13T21:46:35.638948+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness from functional equation) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

F measures the topological rigidity of a metric’s scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. ... FX = 1−rM(X).
IndisputableMonolith/Foundation/ (forcing chain) reality_from_one_distinction matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

Lift(π)≤L0(1−F+Δn)·W̃1(Pπ,P0)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

URLhttps://openreview.net/forum?id=VtmBAGCN7o. Guido W. Imbens and Donald B. Rubin.Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, USA, 2015. ISBN 0521885884. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language mod...

work page 2015
[2]

When single-agent with skills replace multi-agent systems and when they fail,

URLhttps://openreview.net/forum?id=XmProj9cPs. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023. Xiaoxiao Li....

work page arXiv 2023
[3]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

URLhttps://arxiv.org/abs/2603.25158. Kun Ouyang, Haoyu Wang, and Dong Fang. Fela: A multi-agent evolutionary system for feature engineering of industrial event log data.arXiv preprint arXiv:2510.25223, 2025. Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language mod...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.24432/c5859h 2025
[4]

Every approach MUST be methodologically sound and appropriate for the problem -- do not invent approaches just to fill the list

work page
[5]

Approaches may be similar to each other; fine-grained differences are acceptable (e.g., same method with different covariate sets, different hyperparameters, or slightly different model specifications)

work page
[6]

If the problem only supports a small number of truly reasonable strategies, generate variations within those strategies rather than forcing unrelated ones

work page
[7]

Focus on the high-level methodology and strategy, not specific values, feature names, or implementation details (those are for the executor to decide)

Be specific about WHAT to do -- name the method, key steps, and any important implementation choices. Focus on the high-level methodology and strategy, not specific values, feature names, or implementation details (those are for the executor to decide)

work page
[8]

selected_method

Each approach should be self-contained and actionable Output format (use exactly this format): ## Approach 1: [Brief Name] [2-4 sentences describing the core idea and key steps] ## Approach 2: [Brief Name] ... --- ## Original Problem {problem} --- **IMPORTANT -- YOUR TASK RIGHT NOW:** Do NOT solve the problem above. Do NOT output JSON, SQL, a matrix, or a...

work page 1972
[9]

Run check_overlap (verify common support)

work page
[10]

PSM estimates ATT only -- using PSM as ATE gives ~40% MRE when treated/control populations differ

Call estimate_aipw for final ATE <-- NEW AIPW = doubly-robust augmented IPW; estimates population ATE. PSM estimates ATT only -- using PSM as ATE gives ~40% MRE when treated/control populations differ. Do NOT report estimate_psm output as the final ate value. Theestimate_aipw()function was added toestimators.pywith bootstrap SE. Summary.Table 10 compares ...

work page