pith. machine review for the scientific record. sign in

arxiv: 2604.01608 · v3 · submitted 2026-04-02 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords skill distillationmulti-agent systemssingle-agentevaluation metricsmetric freedomadaptive distillationagent trajectories
0
0 comments X

The pith

Skill distillation from multi-agent to single-agent succeeds or fails based on the evaluation metric's freedom, not the task itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that distilling multi-agent systems into single agents improves performance only when the evaluation metric has high topological freedom, as measured by how output diversity couples with score variance. This metric-level property explains inconsistent empirical results where the same trajectories yield gains or losses depending on the scorer. By introducing Metric Freedom as a predictor, the work allows practitioners to decide in advance whether distillation will be worthwhile, leading to an adaptive method that selectively extracts and refines skills to achieve similar or better results at far lower cost and latency. A sympathetic reader would care because it turns an empirical gamble into a calculable decision, potentially making complex agent systems more practical for deployment.

Core claim

The central claim is that skill utility is governed not by the task but by the evaluation metric. The authors introduce Metric Freedom (F), quantified via Mantel test on diversity-score coupling, which strongly predicts distillation outcomes (r = -0.85). They show that identical agent trajectories produce opposite skill lifts under rigid versus free metrics, proving the property is metric-level. Based on this, they develop AdaSkill, a two-stage framework that extracts selectively on free metrics and refines iteratively to maximize headroom while matching or exceeding MAS performance with up to 8x cost reduction.

What carries the argument

Metric Freedom (F), which measures the topological rigidity of a metric's scoring landscape by quantifying the coupling between output diversity and score variance through a Mantel test; it serves as an a priori predictor that determines whether distillation preserves or harms performance.

Load-bearing premise

The topological rigidity captured by the Mantel test on diversity-score coupling is the primary causal driver of distillation success and generalizes beyond the 6 metrics and 11 datasets tested.

What would settle it

Finding a new collection of metrics or tasks where the correlation between Metric Freedom and observed skill utility falls below statistical significance or where identical trajectories no longer produce opposite skill lifts under rigid versus free metrics.

Figures

Figures reproduced from arXiv: 2604.01608 by Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang.

Figure 1
Figure 1. Figure 1: AdaSkill system overview [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Task-level performance, cost, and latency overview. AdaSkill matches or outperforms all baselines on accuracy while preserving cost and latency. Per-dataset breakdowns in Appendix E. • Analyzer is spawned statelessly per iteration to diagnose the 1–3 most severe failure traces. Zero memory between iterations prevents compounding hallucinations. It classifies root causes and atomically injects fixes into to… view at source ↗
Figure 3
Figure 3. Figure 3: Metric freedom F predicts performance lift of skills. Both output-space (a) and reasoning-space (b) measures confirm the negative trend (r=−0.85 and r=−0.77), validating F as a predictor of skill utility. Large circles = metric-level aggregates; small circles = individual datasets. underperforms both the Base Agent and our Adaskill, proving that simply copying agentic structure is suboptimal without F-guid… view at source ↗
Figure 4
Figure 4. Figure 4: F quantifies whether path differences predict metric outcomes. Each point pairs two raw agent runs: x = path distance, y = metric difference. A tighter fit yields lower F, indicating that skill￾based path control reliably shifts the metric. Global Correlation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity of F to evaluation budget (M, N). The gold line marks our operating point (N=6, M=6, $6.12), which reaches balance between reliability and cost. Text-to-SQL. At midpoint (F=0.50), our framework predicts moderate but reliable benefit ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation validates F-guided component selection. Top row: headroom gain (comp − raw)/(max − min); bottom row: ∆Cost in USD; bars = ±1 std. Tools and knowledge yield gains; pipeline (c, r=−0.83) hurts on high-F metrics, motivating selective application. reduces latency from ∼10 hours to 45 minutes, confirming that minimal structure is optimal when the scoring landscape is flat (full results in Appendix D.2)… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on pipeline con￾figuration. Adaptive Distill (F-guided) outperforms both the full-pipeline and no-pipeline baselines overall. The full￾pipeline baseline favors low-F; the no￾pipeline baseline favors high-F. Component Attribution. To isolate why F predicts lift, we ablate skills into atomic components ( [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Diversity planner ablation on F estimation. Both panels show FMSA and FMRE (×100) as a function of evaluation budget (M questions, N runs). At large budgets (M=N=20) both methods converge to similar estimates (FMSA: 55 vs. 53; FMRE: 85 vs. 85). However, at our operating point (M=6, N=6), the diversity planner achieves < 5% error (FMSA=52, FMRE=86) while independent runs show ∼20% error (FMSA=42, FMRE=78). … view at source ↗
Figure 9
Figure 9. Figure 9: Per-run distribution of F out across all metrics. Dot size ∝ questions per run; colour encodes domain; labels show dataset for Causal Discovery. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 F out (output-space freedom) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 F trace obs (re asoning-spa c e fre edo m) Pearson r = 0.50 n = 49 question-metric pairs Agreement Between F out and F trace AUC EX F1 MRE MSA NSF SRR OLS (r = 0.50… view at source ↗
Figure 10
Figure 10. Figure 10: Agreement between F out and F trace. Each point is one (metric, dataset) tuple; OLS line and Pearson r shown [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Metric freedom F predicts skill lift under GPT-5.1 (backbone generalization). Replication of the Freedom Spectrum analysis with GPT-5.1 as the backbone. Both Fout (a) and Ftrace (b) preserve the negative trend (r=−0.71, p<0.01 and r=−0.79, p<0.001). Dotted blue lines show the Sonnet 4.6 reference trend from [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: provides a complete per-dataset breakdown of performance (Row 1), cache-hit cost (Row 2), and latency (Row 3) across all 11 datasets, complementing the summary scatter plot in [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Stage 2 iterator trajectories validate F-dependent convergence behavior. Low-F CE-MSA (a) gains quickly then oscillates, illustrating the knife-edge landscape risk. Mid-to-high-F tasks—CD (b), T2SQL (c), and FE (d)—improve steadily and plateau cleanly, confirming that safe monotonic refinement is achievable precisely where Stage 1 leaves the most headroom. Solid = val; dashed = train; green line = selecte… view at source ↗
Figure 14
Figure 14. Figure 14: Stage 1 architecture transformation for CE (FMSA ≈ 0). Left: the original CAIS MAS has 8 LLM-backed agents on a shared state dict (8 red dots = 8 sequential LLM calls/query). Right: the adaptive skill retains tools and knowledge as freely-invocable layered modules and discards pipeline ordering and agent coordination. LLM calls reduce to 1–3× while CE-MSA gains +28 pp. What changes in the representation o… view at source ↗
Figure 15
Figure 15. Figure 15: Stage 2 iterator trajectory for CE-MSA. Version boxes show Val MSA; v2 (green, bold border) is the globally best version. Iteration 1 applies two concurrent fixes (+20 pp, 60%→80%); iteration 2 applies one further fix (+20 pp, 80%→100%). Iterations 3–4 enter oscillation (orange shaded region): each patch repairs one rule but inadvertently breaks another, yielding no net gain. The iterator identifies v2 an… view at source ↗
read the original abstract

Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom (F), the first a priori predictor of skill utility. F measures the topological rigidity of a metric's scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by F, we propose AdaSkill, a two-stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on "free" metrics to preserve exploration. Stage 2 applies iterative refinement selectively on free metrics, exploiting their forgiving scoring landscape to safely maximize remaining headroom. Evaluating across 4 tasks, 11 datasets, and 6 metrics, F strongly predicts skill utility (r=-0.85, p<0.0001). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric-level property. Driven by this signal, AdaSkill matches or exceeds the original MAS while reducing cost up to 8x and latency by up to 15x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that distilling multi-agent systems into single-agent skills yields inconsistent results (28% lift to 2% degradation) that are governed by the evaluation metric rather than the task. It introduces Metric Freedom (F), computed via a Mantel test on the coupling between output diversity and score variance, as the first a priori predictor of skill utility (reported r=-0.85, p<0.0001). The authors further claim that identical trajectories produce opposite skill lifts under rigid vs. free metrics, and propose the two-stage AdaSkill framework that selectively extracts and refines to match or exceed MAS performance at up to 8x lower cost.

Significance. If the correlation is robust and F can be operationalized without full post-sampling, the work would be significant for MAS research by supplying a concrete, metric-level criterion for deciding when distillation is beneficial. The demonstration that utility is a property of the scoring landscape rather than the underlying trajectories is a useful reframing, and the reproducible correlation across 4 tasks/11 datasets/6 metrics plus the cost-reduction results of AdaSkill could influence practical deployment of agent systems.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Metric Freedom definition): The claim that F is an 'a priori predictor' that can be obtained 'before deciding whether or how to distill' is contradicted by the computation itself. F requires generating multiple outputs to compute pairwise distances (diversity) and score variance before applying the Mantel test; this sampling step cannot be performed without model inference, so F is necessarily post-sampling and cannot serve as a pre-distillation selector without already running the trajectories whose utility it is meant to predict.
  2. [Results] Results section (correlation reporting): The central r=-0.85 correlation is presented without error bars, sensitivity analysis to sampling seed or number of samples, or controls for confounders such as task length, dataset size, or metric scale. This weakens the load-bearing claim that F 'strongly predicts' utility and that the result generalizes beyond the 6 metrics tested.
minor comments (2)
  1. [§3] Notation for F and the Mantel statistic should be defined with an explicit equation (currently only described in prose) so that readers can reproduce the exact coupling measure.
  2. [Figure 4 or Table 2] The abstract states 'identical agent trajectories yield diametrically opposite skill lifts'; the corresponding figure or table should report the exact trajectories and metric pairs used for this demonstration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments help clarify the practical scope of Metric Freedom and strengthen the statistical presentation of our results. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Metric Freedom definition): The claim that F is an 'a priori predictor' that can be obtained 'before deciding whether or how to distill' is contradicted by the computation itself. F requires generating multiple outputs to compute pairwise distances (diversity) and score variance before applying the Mantel test; this sampling step cannot be performed without model inference, so F is necessarily post-sampling and cannot serve as a pre-distillation selector without already running the trajectories whose utility it is meant to predict.

    Authors: We agree that the original wording overstated the pre-inference nature of F. Computing F requires a small pilot sample (typically 10–20 trajectories per task), which necessarily involves model inference. However, this cost is substantially lower than full multi-agent execution or complete distillation. We will revise the abstract and §3 to describe F as a low-cost, post-pilot predictor that can be obtained before committing to full-scale distillation, rather than claiming it is strictly a priori. This adjustment preserves the practical utility while accurately reflecting the computation. revision: partial

  2. Referee: [Results] Results section (correlation reporting): The central r=-0.85 correlation is presented without error bars, sensitivity analysis to sampling seed or number of samples, or controls for confounders such as task length, dataset size, or metric scale. This weakens the load-bearing claim that F 'strongly predicts' utility and that the result generalizes beyond the 6 metrics tested.

    Authors: We accept this critique and will strengthen the statistical reporting. The revised Results section will include bootstrap-derived 95% confidence intervals on the reported correlation, sensitivity analyses across sample sizes (5–50 trajectories) and random seeds, and explicit controls for task length, dataset size, and metric scale. These additional checks confirm that the correlation remains stable (r ≈ −0.82 to −0.87) under the tested variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; F is independently computed and validated via correlation on held-out structure

full rationale

The paper defines Metric Freedom (F) via Mantel test on pairwise output diversity versus score variance matrices obtained from sampled trajectories. It then reports a cross-task correlation r=-0.85 between these F values and observed skill-utility lifts. No equation or procedure shows F being regressed, optimized, or algebraically reduced against the utility numbers themselves; the correlation is presented as an empirical validation rather than a fitted predictor. The sampling step needed to obtain diversity and variance is a computational prerequisite but does not make the reported relationship tautological or self-definitional. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the provided derivation chain. The central claim therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the new definition of F via Mantel test and the assumption that this statistical property governs distillation outcomes across tasks.

axioms (1)
  • domain assumption Mantel test assumptions hold for the scoring landscapes of the evaluated metrics
    Invoked to quantify coupling between output diversity and score variance.
invented entities (1)
  • Metric Freedom (F) no independent evidence
    purpose: A priori predictor of skill distillation utility
    Newly introduced measure without independent evidence outside the paper's experiments.

pith-pipeline@v0.9.0 · 5593 in / 1301 out tokens · 74855 ms · 2026-05-13T21:46:35.638948+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    URLhttps://openreview.net/forum?id=VtmBAGCN7o. Guido W. Imbens and Donald B. Rubin.Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, USA, 2015. ISBN 0521885884. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language mod...

  2. [2]

    When single-agent with skills replace multi-agent systems and when they fail,

    URLhttps://openreview.net/forum?id=XmProj9cPs. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023. Xiaoxiao Li....

  3. [3]

    Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

    URLhttps://arxiv.org/abs/2603.25158. Kun Ouyang, Haoyu Wang, and Dong Fang. Fela: A multi-agent evolutionary system for feature engineering of industrial event log data.arXiv preprint arXiv:2510.25223, 2025. Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language mod...

  4. [4]

    Every approach MUST be methodologically sound and appropriate for the problem -- do not invent approaches just to fill the list

  5. [5]

    Approaches may be similar to each other; fine-grained differences are acceptable (e.g., same method with different covariate sets, different hyperparameters, or slightly different model specifications)

  6. [6]

    If the problem only supports a small number of truly reasonable strategies, generate variations within those strategies rather than forcing unrelated ones

  7. [7]

    Focus on the high-level methodology and strategy, not specific values, feature names, or implementation details (those are for the executor to decide)

    Be specific about WHAT to do -- name the method, key steps, and any important implementation choices. Focus on the high-level methodology and strategy, not specific values, feature names, or implementation details (those are for the executor to decide)

  8. [8]

    selected_method

    Each approach should be self-contained and actionable Output format (use exactly this format): ## Approach 1: [Brief Name] [2-4 sentences describing the core idea and key steps] ## Approach 2: [Brief Name] ... --- ## Original Problem {problem} --- **IMPORTANT -- YOUR TASK RIGHT NOW:** Do NOT solve the problem above. Do NOT output JSON, SQL, a matrix, or a...

  9. [9]

    Run check_overlap (verify common support)

  10. [10]

    PSM estimates ATT only -- using PSM as ATE gives ~40% MRE when treated/control populations differ

    Call estimate_aipw for final ATE <-- NEW AIPW = doubly-robust augmented IPW; estimates population ATE. PSM estimates ATT only -- using PSM as ATE gives ~40% MRE when treated/control populations differ. Do NOT report estimate_psm output as the final ate value. Theestimate_aipw()function was added toestimators.pywith bootstrap SE. Summary.Table 10 compares ...