arxiv: 2603.25158 · v4 · submitted 2026-03-26 · 💻 cs.AI

Recognition: no theorem link

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni , Yihao Liu , Xinpeng Liu , Yutao Sun , Mengyu Zhou , Pengyu Cheng , Dexin Wang , Erchao Zhao

show 2 more authors

Xiaoxi Jiang Guanjun Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:12 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsskill distillationtrajectory analysistransferable skillsinductive consolidationagent improvementdeclarative skills

0 comments

The pith

Trace2Skill turns pools of agent execution traces into single, transferable skill guides by parallel analysis and hierarchical consolidation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLM agents improve when lessons from many trajectories are extracted together rather than one at a time. A fleet of sub-agents reviews a diverse set of runs in parallel, pulls out local lessons, and then merges them inductively into one conflict-free skill directory. This directory can be added to existing human skills or built from scratch. The resulting skills raise performance in spreadsheet, VisionQA, and math tasks, and they work on larger models or new tasks even though no weights are updated. One reported case is a 35B model evolving skills that lift a 122B model by more than 57 points on an out-of-distribution table question set.

Core claim

Trace2Skill dispatches parallel sub-agents to analyze a broad pool of execution trajectories, extracts trajectory-specific lessons, and hierarchically consolidates them into a unified, conflict-free skill directory through inductive reasoning. The directory can deepen human-written skills or create new ones, and the resulting declarative skills improve agent behavior across model scales and out-of-distribution settings without parameter updates or external retrieval.

What carries the argument

Parallel fleet of sub-agents that perform trajectory analysis followed by hierarchical inductive consolidation into a single skill directory.

Load-bearing premise

Parallel sub-agent review of multiple trajectories followed by inductive merging will reliably remove conflicts and yield skills that generalize instead of overfitting to the sampled runs or the reviewer models.

What would settle it

Skills produced by the process fail to raise accuracy when applied to a fresh set of tasks that differ in structure from the original trajectory pool or when transferred to an LLM whose scale or training data differs substantially from the evolution model.

read the original abstract

Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic's official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills -- requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Trace2Skill, a framework that dispatches parallel sub-agents to analyze diverse LLM agent execution trajectories, extracts trajectory-specific lessons, and applies hierarchical inductive consolidation to produce a unified, conflict-free skill directory. The method supports both refining human-written skills and generating new ones from scratch. Experiments in spreadsheet, VisionQA, and math reasoning domains report significant gains over strong baselines including Anthropic's official xlsx skills, with evolved skills transferring across LLM scales (e.g., Qwen3.5-35B skills improving a 122B model by up to 57.65 absolute percentage points on WikiTableQuestions) and generalizing to OOD settings without parameter updates or retrieval modules.

Significance. If the transfer and generalization results hold under rigorous validation, the work would be significant for demonstrating a scalable, trajectory-grounded approach to packaging agent experience into declarative, model-agnostic skills using open-source models as small as 35B parameters. This addresses the manual authoring bottleneck and sequential overfitting issues in automated skill generation, with potential to improve complex agent performance across domains while remaining parameter-free.

major comments (2)

[Abstract / Method] Abstract and method description: The hierarchical inductive consolidation step is presented as producing 'conflict-free' skills, but no explicit mechanism, algorithm, or pseudocode is given for conflict detection, resolution, or generality assessment; this is load-bearing for the claim that the process abstracts beyond trajectory-local patterns rather than overfitting to input executions or model quirks.
[Experiments] Experimental results: Headline claims of up to 57.65 pp absolute gains and cross-scale/OOD transfer are stated without error bars, statistical tests, ablation isolating the hierarchical consolidation from simpler aggregation, data exclusion rules, or details on held-out trajectory pools; this leaves the support for generalizability unassessable from the reported assertions.

minor comments (2)

[Method] Clarify how the parallel sub-agent fleet size and diversity criteria are chosen, and whether any hyper-parameters are tuned on the same trajectories used for skill extraction.
[Experiments] Add a table or figure summarizing baseline comparisons with exact metrics, model sizes, and task variants for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate clarifications and additional experimental details.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: The hierarchical inductive consolidation step is presented as producing 'conflict-free' skills, but no explicit mechanism, algorithm, or pseudocode is given for conflict detection, resolution, or generality assessment; this is load-bearing for the claim that the process abstracts beyond trajectory-local patterns rather than overfitting to input executions or model quirks.

Authors: We agree that an explicit description of the consolidation process would strengthen the manuscript. The current text describes the step at a high level as parallel lesson extraction followed by inductive reasoning to yield a unified directory, but we will add a dedicated Methods subsection with pseudocode. This will detail conflict detection (via semantic contradiction checks across extracted lessons), resolution (by retaining the most general formulation supported by multiple trajectories), and generality assessment (via cross-validation on held-out executions). These additions will clarify how the process moves beyond trajectory-local patterns. revision: yes
Referee: [Experiments] Experimental results: Headline claims of up to 57.65 pp absolute gains and cross-scale/OOD transfer are stated without error bars, statistical tests, ablation isolating the hierarchical consolidation from simpler aggregation, data exclusion rules, or details on held-out trajectory pools; this leaves the support for generalizability unassessable from the reported assertions.

Authors: We acknowledge that greater statistical rigor and ablation details are needed to fully substantiate the generalizability claims. In the revised version we will report means and standard deviations across multiple runs, include paired statistical tests for the reported gains, add an ablation comparing hierarchical consolidation against simpler aggregation baselines, specify trajectory exclusion criteria (e.g., discarding failed executions), and expand the description of the held-out pools used for cross-scale and OOD evaluations. These changes will make the evidence more assessable while preserving the core experimental design. revision: yes

Circularity Check

0 steps flagged

No circularity: skill distillation derives from external trajectories via described inductive steps and is validated on held-out/cross-model tests

full rationale

The paper describes an empirical framework that dispatches parallel sub-agents to analyze a diverse pool of external executions, extracts trajectory-specific lessons, and applies hierarchical inductive consolidation to produce a unified skill directory. Reported performance gains (e.g., cross-scale transfer and OOD generalization on WikiTableQuestions) are measured on held-out or different-model settings rather than being produced by any internal fit, self-definition, or equation that reduces to the input trajectories by construction. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the method chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework introduces no new mathematical constants or entities; it relies on the domain assumption that LLMs can perform reliable inductive reasoning over trajectory sets to produce generalizable skills.

axioms (1)

domain assumption LLMs can perform reliable inductive reasoning to consolidate trajectory-specific lessons into conflict-free unified skills.
Invoked as the core mechanism that turns parallel analysis into transferable output.

pith-pipeline@v0.9.0 · 5626 in / 1294 out tokens · 37978 ms · 2026-05-15T01:12:54.612052+00:00 · methodology

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?
cs.AI 2026-04 unverdicted novelty 7.0

Metric Freedom (F), quantified via Mantel test on output diversity and score variance, predicts when single-agent skill distillation from multi-agent systems will succeed, enabling up to 8x cost and 15x latency reduct...
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
SkillEvolver: Skill Learning as a Meta-Skill
cs.AI 2026-05 unverdicted novelty 6.0

A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
cs.CL 2026-05 unverdicted novelty 6.0

SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
Evidence Over Plans: Online Trajectory Verification for Skill Distillation
cs.AI 2026-05 unverdicted novelty 6.0

PDI-guided distillation from environment-verified trajectories yields skills that surpass no-skill baselines and human-written skills across 86 tasks with far lower inference cost.
SkillGen: Verified Inference-Time Agent Skill Synthesis
cs.LG 2026-05 unverdicted novelty 6.0

SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
cs.AI 2026-04 unverdicted novelty 6.0

ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
cs.AI 2026-04 conditional novelty 6.0

The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
cs.AI 2026-05 unverdicted novelty 5.0

Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to ...
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.