SKILL-DISCO: Distilling and Compiling Agent Traces into Reusable Procedural Skills

Danrui Qi; Hanwen Gu; Peng Cheng; Yongqiang Xiong; Zhongxin Guo

arxiv: 2606.26669 · v1 · pith:K7K52EWPnew · submitted 2026-06-25 · 💻 cs.AI

SKILL-DISCO: Distilling and Compiling Agent Traces into Reusable Procedural Skills

Zhongxin Guo , Danrui Qi , Hanwen Gu , Peng Cheng , Yongqiang Xiong This is my paper

Pith reviewed 2026-06-26 05:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords procedural skillsagent tracesdistillationcompilationcontrol-flow subgraphsALFWorldWebArenaFSM scenarios

0 comments

The pith

SkillDisCo distills reusable parameterized control-flow subgraphs from successful agent traces and compiles them into callable procedural skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address agents that repeatedly solve similar tasks from scratch by showing how shared procedural structure can be extracted as reusable skills. It focuses on FSM-defined scenarios where traces form paths in a transition graph, allowing skills to be represented as parameterized control-flow subgraphs. The SkillDisCo framework distills these subgraphs from traces and compiles them into executable, verifiable skills that agents can call directly. This leads to higher success rates and shorter execution traces on benchmarks like ALFWorld and WebArena, across different model sizes.

Core claim

In FSM-defined scenarios, successful traces can be viewed as paths in an unknown transition graph, and procedural skills can be formulated as reusable parameterized control-flow subgraphs. SkillDisCo distills these PFSM subgraphs from traces and compiles them into callable, executable, and verifiable procedural skills, which improves success rates and reduces agent turns on ALFWorld and WebArena across benchmarks and model scales.

What carries the argument

Reusable parameterized control-flow subgraphs (PFSM subgraphs) distilled from traces, which capture shared execution structure and compile into callable procedural skills.

Load-bearing premise

Successful traces can be viewed as paths in an unknown transition graph and procedural skills can be formulated as reusable parameterized control-flow subgraphs in FSM-defined scenarios.

What would settle it

A test in which SkillDisCo is applied to the same traces and models on ALFWorld and WebArena but produces no increase in success rates or decrease in agent turns.

Figures

Figures reproduced from arXiv: 2606.26669 by Danrui Qi, Hanwen Gu, Peng Cheng, Yongqiang Xiong, Zhongxin Guo.

**Figure 1.** Figure 1: SKILL-DISCO distills environment-adaptive PFSM skills that branch on observations, transfer across episodes and model scales without re-induction. is given. SKILL-DISCO therefore approximates PFSM-based skill discovery by recovering reusable parameterized control-flow patterns from successful traces and validating them through skill compilation. This yields a distillation-and-compilation framework. Disti… view at source ↗

**Figure 2.** Figure 2: Overview of SKILL-DISCO. Distillation phase turns successful traces into reusable PFSM subgraphs; Compilation phase converts them into executable and verified skills. appliance, and applying the appliance. Treating the whole trace as one skill would make the induced routine too specific, while treating each primitive operator as a skill would lose the benefit of procedural reuse. SKILL-DISCO therefore ext… view at source ↗

read the original abstract

Agents often repeatedly solve similar task instances from scratch, leading to unnecessary reasoning cost and long execution traces. Prior work has explored workflow reuse and executable skill induction, but it remains unclear which task scenarios admit procedural skills and how the shared procedural structure should be represented across successful traces. We study this problem in FSM-defined scenarios, where successful traces can be viewed as paths in an unknown transition graph, and formulate procedural skills as reusable parameterized control-flow subgraphs. Based on this view, we introduce SkillDisCo, a distillation-and-compilation framework that distills reusable PFSM subgraphs from successful traces and compiles them into callable, executable, and verifiable procedural skills. Experiments on ALFWorld and WebArena show that SkillDisCo improves success rates and reduces agent turns across benchmarks and model scales, demonstrating the benefits of representing shared experience as reusable execution structures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillDisCo frames reusable skills as PFSM subgraphs distilled from traces and reports gains on two benchmarks, but the abstract supplies almost no method or analysis details.

read the letter

SkillDisCo introduces a distillation-and-compilation approach to turn successful agent traces into reusable procedural skills represented as parameterized subgraphs in a probabilistic finite state machine. The experiments report better success rates and fewer turns on ALFWorld and WebArena, but the abstract leaves the actual mechanics of the method unclear.

The contribution that stands out is the explicit modeling of skills as reusable PFSM subgraphs extracted from traces in FSM-defined scenarios. This gives a concrete structure for what counts as a procedural skill and how it can be compiled into something callable and verifiable. It builds on prior work in workflow reuse but adds this graph view for control flow.

The paper handles the motivation well by pointing out how agents waste effort on similar tasks and demonstrates through experiments that this reuse can help across different environments and model scales.

Where it is soft is the missing detail on the core steps. There is no description of the distillation process, how subgraphs are selected or parameterized, or any error analysis and controls in the experiments. This makes it difficult to tell if the gains come from the PFSM representation or from other factors in the implementation. The stress-test point about WebArena's dynamic and partially observable nature is worth paying attention to; if the paper does not show how they handle that with a fixed transition graph, the central assumption may not hold as strongly as presented.

Readers working on agent efficiency, skill learning, or execution reuse in AI would find this relevant. It is the kind of paper that could spark ideas even if the current version needs more work on the technical exposition.

I would recommend sending it to peer review so that the full methods and results can be examined properly.

Referee Report

2 major / 1 minor

Summary. The paper claims that in FSM-defined scenarios, successful agent traces can be viewed as paths in an unknown transition graph, and formulates procedural skills as reusable parameterized control-flow subgraphs (PFSM). It introduces SkillDisCo, a distillation-and-compilation framework that extracts such subgraphs from traces and compiles them into callable, executable, and verifiable skills. Experiments on ALFWorld and WebArena report improved success rates and reduced agent turns across benchmarks and model scales.

Significance. If the central modeling assumptions and empirical gains hold, the work offers a structured approach to reusing procedural experience as verifiable execution structures rather than repeated reasoning from scratch. The PFSM representation provides a concrete, potentially falsifiable way to capture shared control-flow across task instances.

major comments (2)

[§3] §3 (Modeling): The central claim requires that successful traces admit a clean decomposition into reusable parameterized PFSM subgraphs whose parameters transfer across instances. For WebArena, dynamic page states, JavaScript execution, and partial observability make a fixed transition graph difficult to maintain without heavy abstraction; the paper must show that the distillation step reliably recovers such subgraphs rather than relying on environment-specific engineering.
[§5] §5 (Experiments): The reported gains in success rate and reduced turns are load-bearing for the claim that the PFSM representation is beneficial. The manuscript must include ablations that isolate the contribution of the parameterized subgraph representation versus alternative distillation methods or non-FSM skill induction, along with statistical significance, variance across runs, and controls for trace quality.

minor comments (1)

[Abstract] Abstract: The phrase 'across benchmarks and model scales' should specify the exact models, number of runs, and precise metrics (e.g., success rate deltas) to allow immediate assessment of the strength of the empirical claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Modeling): The central claim requires that successful traces admit a clean decomposition into reusable parameterized PFSM subgraphs whose parameters transfer across instances. For WebArena, dynamic page states, JavaScript execution, and partial observability make a fixed transition graph difficult to maintain without heavy abstraction; the paper must show that the distillation step reliably recovers such subgraphs rather than relying on environment-specific engineering.

Authors: The framework is explicitly scoped to FSM-defined scenarios where an underlying transition structure can be abstracted from observations. For WebArena we employ a state abstraction that encodes page elements and action effects while abstracting away transient JavaScript dynamics; the distillation procedure then operates uniformly on the resulting traces. We will revise §3 to explicitly document this abstraction, include examples of recovered parameterized subgraphs that transfer across WebArena task instances, and add a short analysis showing that the extracted subgraphs are not artifacts of bespoke engineering. revision: yes
Referee: [§5] §5 (Experiments): The reported gains in success rate and reduced turns are load-bearing for the claim that the PFSM representation is beneficial. The manuscript must include ablations that isolate the contribution of the parameterized subgraph representation versus alternative distillation methods or non-FSM skill induction, along with statistical significance, variance across runs, and controls for trace quality.

Authors: We agree that isolating the contribution of the parameterized PFSM representation and providing statistical support are necessary. In the revised §5 we will add (i) an ablation replacing parameterized subgraphs with non-parameterized and non-FSM baselines, (ii) mean and standard deviation of success rate and turn count over five independent runs per setting, (iii) paired t-tests for significance, and (iv) a control experiment that substitutes lower-quality traces. These additions will be reported for both ALFWorld and WebArena. revision: yes

Circularity Check

0 steps flagged

No circularity: framework formulation with no equations or fitted predictions

full rationale

The paper presents SkillDisCo as a distillation-and-compilation framework based on viewing traces as paths in an FSM transition graph and skills as parameterized PFSM subgraphs. This is introduced as a modeling choice in the abstract and full text, without any equations, derivations, parameter fitting, or predictions that reduce to inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way. The experimental claims rest on empirical results rather than any self-referential reduction. This is the common case of a self-contained descriptive framework.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, parameters, or explicit assumptions beyond the high-level framing of FSM traces and subgraphs.

pith-pipeline@v0.9.1-grok · 5683 in / 1070 out tokens · 33548 ms · 2026-06-26T05:01:13.116113+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 26 canonical work pages · 16 internal anchors

[1]

Advances in Neural Information Processing Systems , volume =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =
[2]

2023 , url =

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , journal =...

2023
[3]

Advances in Neural Information Processing Systems , volume =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , volume =
[4]

2025 , url =

Qiu, Jieyi and Juan, Xiang and Wang, Yan and Yang, Lei and Qi, Xin and Zhang, Tao and Guo, Jun and Lu, Yun and Yao, Zheng and Wang, Mei and others , journal =. 2025 , url =

2025
[5]

Toolformer: Language Models Can Teach Themselves to Use Tools

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. arXiv preprint arXiv:2302.04761 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Automated Design of Agentic Systems (

Hu, Shengran and Lu, Cong and Clune, Jeff , journal =. Automated Design of Agentic Systems (. 2024 , url =

2024
[8]

2024 , url =

Shang, Yu and Li, Yu and Zhao, Keyu and Ma, Likai and Liu, Jiahe and Xu, Fengli and Li, Yong , journal =. 2024 , url =

2024
[9]

2024 , url =

Zhang, Jiayi and Xiang, Jinyu and Yu, Zhaoyang and Teng, Fengwei and Chen, Xionghui and Chen, Jiaqi and Zhuge, Mingchen and Cheng, Xin and Hong, Sirui and others , journal =. 2024 , url =

2024
[11]

and Jin, Xiaolong and Wang, Zora Zhiruo and Gandhi, Apurva and Song, Yueqi and Gu, Yu and Srinivasa, Jayanth and others , journal =

Zheng, Boyuan and Fatemi, Michael Y. and Jin, Xiaolong and Wang, Zora Zhiruo and Gandhi, Apurva and Song, Yueqi and Gu, Yu and Srinivasa, Jayanth and others , journal =. 2025 , url =

2025
[12]

2023 , url =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

2023
[13]

2021 , url =

Shridhar, Mohit and Yuan, Xingdi and C\^ot\'e, Marc-Alexandre and Bisk, Yonatan and Trischler, Adam and Hausknecht, Matthew , booktitle =. 2021 , url =

2021
[14]

2024 , eprint =

Executable Code Actions Elicit Better LLM Agents , author =. 2024 , eprint =

2024
[16]

Sutton, Richard S and Precup, Doina and Singh, Satinder , journal =. Between
[19]

and Qin, Yujia and Liu, Zhiyuan and Ji, Heng , journal =

Qian, Cheng and Han, Chi and Fung, Yi R. and Qin, Yujia and Liu, Zhiyuan and Ji, Heng , journal =. 2023 , url =

2023
[20]

2024 , url =

Wang, Renxi and Han, Xudong and Ji, Lei and Wang, Shu and Baldwin, Timothy and Li, Haonan , journal =. 2024 , url =

2024
[21]

2023 , url =

Ruan, Jingqing and Chen, Yihong and Zhang, Bin and Xu, Zhiwei and Bao, Tianpeng and Du, Guoqing and Shi, Shiwei and Mao, Hangyu and Li, Ziyue and Zeng, Xingyu and Zhao, Rui , journal =. 2023 , url =

2023
[22]

2023 , url =

Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , journal =. 2023 , url =

2023
[23]

IEEE International Conference on Robotics and Automation (ICRA) , pages =

Code as Policies: Language Model Programs for Embodied Control , author =. IEEE International Conference on Robotics and Automation (ICRA) , pages =
[24]

2024 , url =

Xu, Yiheng and Lu, Dunjie and Shen, Zhennan and Wang, Junli and Wang, Zekun and Mao, Yuchen and Xiong, Caiming and Yu, Tao , journal =. 2024 , url =

2024
[25]

arXiv preprint arXiv:2506.04287 , year =

Automated Skill Discovery for Language Agents through Exploration and Iterative Feedback , author =. arXiv preprint arXiv:2506.04287 , year =

work page arXiv
[26]

Distilling the Knowledge in a Neural Network

Distilling the Knowledge in a Neural Network , author =. arXiv preprint arXiv:1503.02531 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[27]

and Liu, Muxin and Tenenbaum, Joshua B

Grand, Gabriel and Wong, Lionel and Bowers, Matthew and Olausson, Theo X. and Liu, Muxin and Tenenbaum, Joshua B. and Andreas, Jacob , journal =. 2024 , url =

2024
[28]

Sarch, Gabriel and Jang, Lawrence and Tarr, Michael and Cohen, William W and Marino, Kenneth and Fragkiadaki, Katerina , journal =
[29]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages =

Wang, Ruoyao and Jansen, Peter and C. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages =. 2022 , url =

2022
[30]

Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =. 2024 , url =

2024
[31]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Emergent autonomous scientific research capabilities of large language models

Emergent Autonomous Scientific Research Capabilities of Large Language Models , author =. arXiv preprint arXiv:2304.05332 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Conference on Robot Learning , year =

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , author =. Conference on Robot Learning , year =
[34]

Dihan, Mahir Labib and Hashem, Tanzima and Ali, Mohammed Eunus and Parvez, Md Rizwan , journal =
[35]

He, Hongliang and Yao, Wenlin and Ma, Kaixin and Yu, Wenhao and Dai, Yong and Zhang, Hongming and Lan, Zhenzhong and Yu, Dong , booktitle =
[36]

2024 , url=

John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

2024
[37]

arXiv preprint arXiv:2508.02085 , year=

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents , author=. arXiv preprint arXiv:2508.02085 , year=

work page arXiv
[38]

2025 , note =

Zhang, Guibin and Fu, Muxin and Wan, Guancheng and Yu, Miao and Wang, Kun and Yan, Shuicheng , booktitle =. 2025 , note =

2025
[39]

Generalizing Experience for Language Agents with Hierarchical

Fan, Shengda and Cong, Xin and Zhang, Zhong and Fu, Yuepeng and Wu, Yesai and Wang, Hao and Zhang, Xinyu and Hu, Enrui and Lin, Yankai , booktitle =. Generalizing Experience for Language Agents with Hierarchical
[40]

Wei, Zhepei and Yao, Wenlin and Liu, Yao and Zhang, Weizhi and Lu, Qin and Qiu, Liang and others , booktitle =
[41]

2026 , url =

Ni, Jingwei and Liu, Yihao and Liu, Xinpeng and Sun, Yutao and Zhou, Mengyu and Cheng, Pengyu and Wang, Dexin and Zhao, Erchao and Jiang, Xiaoxi and Jiang, Guanjun , journal =. 2026 , url =

2026
[42]

Prasad, Archiki and Koller, Alexander and Hartmann, Mareike and Clark, Peter and Sabharwal, Ashish and Bansal, Mohit and Khot, Tushar , booktitle =
[43]

Rozanov, Nikolai and Rei, Marek , booktitle=
[44]

Advances in Neural Information Processing Systems , editor=

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022
[45]

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2023. https://arxiv.org/abs/2305.17126 Large language models as tool makers . arXiv preprint arXiv:2305.17126

work page arXiv 2023
[46]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. https://arxiv.org/abs/2211.10435 PAL : Program-aided language models . arXiv preprint arXiv:2211.10435

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Shengran Hu, Cong Lu, and Jeff Clune. 2024. https://arxiv.org/abs/2408.08435 Automated design of agentic systems ( ADAS ) . arXiv preprint arXiv:2408.08435

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. 2023. Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation (ICRA), pages 9493--9500

2023
[49]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, and 3 others. 2023. https://arxiv.org/abs/2308.03688 AgentBench : Evaluating LLM s as agents . arXiv preprint arXiv:2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. 2026. https://arxiv.org/abs/2603.25158 Trace2Skill : Distill trajectory-local lessons into transferable agent skills . arXiv preprint arXiv:2603.25158. Work in progress

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. 2023. https://arxiv.org/abs/2305.14318 CREATOR : Tool creation for disentangling abstract and concrete reasoning of large language models . arXiv preprint arXiv:2305.14318

work page arXiv 2023
[52]

Jieyi Qiu, Xiang Juan, Yan Wang, Lei Yang, Xin Qi, Tao Zhang, Jun Guo, Yun Lu, Zheng Yao, Mei Wang, and 1 others. 2025. https://arxiv.org/abs/2506.14728 AgentDistill : Training-free agent distillation with generalizable MCP boxes . arXiv preprint arXiv:2506.14728

work page arXiv 2025
[53]

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. 2024. https://arxiv.org/abs/2410.06153 AgentSquare : Automatic LLM agent search in modular design space . arXiv preprint arXiv:2410.06153

work page arXiv 2024
[54]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36

2023
[55]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C\^ot\'e, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. https://arxiv.org/abs/2010.03768 ALFWorld : Aligning text and embodied environments for interactive learning . In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2021
[56]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. https://arxiv.org/abs/2305.16291 Voyager: An open-ended embodied agent with large language models . arXiv preprint arXiv:2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. 2024 a . https://arxiv.org/abs/2410.03439 ToolGen : Unified tool retrieval and calling via generation . arXiv preprint arXiv:2410.03439

work page arXiv 2024
[58]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024 b . https://arxiv.org/abs/2402.01030 Executable code actions elicit better llm agents . In ICML

work page arXiv 2024
[59]

Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. 2025. https://arxiv.org/abs/2504.06821 Inducing programmatic skills for agentic tasks . arXiv preprint arXiv:2504.06821

work page arXiv 2025
[60]

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2024 c . https://arxiv.org/abs/2409.07429 Agent workflow memory . arXiv preprint arXiv:2409.07429

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35

2022
[62]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. https://arxiv.org/abs/2405.15793 SWE -agent: Agent-computer interfaces enable automated software engineering . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. https://arxiv.org/abs/2210.03629 ReAct : Synergizing reasoning and acting in language models . In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. 2025. https://arxiv.org/abs/2502.04180 Multi-agent architecture search via agentic supernet . arXiv preprint arXiv:2502.04180

work page arXiv 2025
[65]

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, and 1 others. 2024. https://arxiv.org/abs/2410.10762 AFlow : Automating agentic workflow generation . arXiv preprint arXiv:2410.10762

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, and 1 others. 2025. https://arxiv.org/abs/2504.07079 SkillWeaver : Web agents can self-improve by discovering and honing skills . arXiv preprint arXiv:2504.07079

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. https://arxiv.org/abs/2307.13854 WebArena : A realistic web environment for building autonomous agents . In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Advances in Neural Information Processing Systems , volume =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =

[2] [2]

2023 , url =

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , journal =...

2023

[3] [3]

Advances in Neural Information Processing Systems , volume =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , volume =

[4] [4]

2025 , url =

Qiu, Jieyi and Juan, Xiang and Wang, Yan and Yang, Lei and Qi, Xin and Zhang, Tao and Guo, Jun and Lu, Yun and Yao, Zheng and Wang, Mei and others , journal =. 2025 , url =

2025

[5] [5]

Toolformer: Language Models Can Teach Themselves to Use Tools

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. arXiv preprint arXiv:2302.04761 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[6] [7]

Automated Design of Agentic Systems (

Hu, Shengran and Lu, Cong and Clune, Jeff , journal =. Automated Design of Agentic Systems (. 2024 , url =

2024

[7] [8]

2024 , url =

Shang, Yu and Li, Yu and Zhao, Keyu and Ma, Likai and Liu, Jiahe and Xu, Fengli and Li, Yong , journal =. 2024 , url =

2024

[8] [9]

2024 , url =

Zhang, Jiayi and Xiang, Jinyu and Yu, Zhaoyang and Teng, Fengwei and Chen, Xionghui and Chen, Jiaqi and Zhuge, Mingchen and Cheng, Xin and Hong, Sirui and others , journal =. 2024 , url =

2024

[9] [11]

and Jin, Xiaolong and Wang, Zora Zhiruo and Gandhi, Apurva and Song, Yueqi and Gu, Yu and Srinivasa, Jayanth and others , journal =

Zheng, Boyuan and Fatemi, Michael Y. and Jin, Xiaolong and Wang, Zora Zhiruo and Gandhi, Apurva and Song, Yueqi and Gu, Yu and Srinivasa, Jayanth and others , journal =. 2025 , url =

2025

[10] [12]

2023 , url =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

2023

[11] [13]

2021 , url =

Shridhar, Mohit and Yuan, Xingdi and C\^ot\'e, Marc-Alexandre and Bisk, Yonatan and Trischler, Adam and Hausknecht, Matthew , booktitle =. 2021 , url =

2021

[12] [14]

2024 , eprint =

Executable Code Actions Elicit Better LLM Agents , author =. 2024 , eprint =

2024

[13] [16]

Sutton, Richard S and Precup, Doina and Singh, Satinder , journal =. Between

[14] [19]

and Qin, Yujia and Liu, Zhiyuan and Ji, Heng , journal =

Qian, Cheng and Han, Chi and Fung, Yi R. and Qin, Yujia and Liu, Zhiyuan and Ji, Heng , journal =. 2023 , url =

2023

[15] [20]

2024 , url =

Wang, Renxi and Han, Xudong and Ji, Lei and Wang, Shu and Baldwin, Timothy and Li, Haonan , journal =. 2024 , url =

2024

[16] [21]

2023 , url =

Ruan, Jingqing and Chen, Yihong and Zhang, Bin and Xu, Zhiwei and Bao, Tianpeng and Du, Guoqing and Shi, Shiwei and Mao, Hangyu and Li, Ziyue and Zeng, Xingyu and Zhao, Rui , journal =. 2023 , url =

2023

[17] [22]

2023 , url =

Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , journal =. 2023 , url =

2023

[18] [23]

IEEE International Conference on Robotics and Automation (ICRA) , pages =

Code as Policies: Language Model Programs for Embodied Control , author =. IEEE International Conference on Robotics and Automation (ICRA) , pages =

[19] [24]

2024 , url =

Xu, Yiheng and Lu, Dunjie and Shen, Zhennan and Wang, Junli and Wang, Zekun and Mao, Yuchen and Xiong, Caiming and Yu, Tao , journal =. 2024 , url =

2024

[20] [25]

arXiv preprint arXiv:2506.04287 , year =

Automated Skill Discovery for Language Agents through Exploration and Iterative Feedback , author =. arXiv preprint arXiv:2506.04287 , year =

work page arXiv

[21] [26]

Distilling the Knowledge in a Neural Network

Distilling the Knowledge in a Neural Network , author =. arXiv preprint arXiv:1503.02531 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[22] [27]

and Liu, Muxin and Tenenbaum, Joshua B

Grand, Gabriel and Wong, Lionel and Bowers, Matthew and Olausson, Theo X. and Liu, Muxin and Tenenbaum, Joshua B. and Andreas, Jacob , journal =. 2024 , url =

2024

[23] [28]

Sarch, Gabriel and Jang, Lawrence and Tarr, Michael and Cohen, William W and Marino, Kenneth and Fragkiadaki, Katerina , journal =

[24] [29]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages =

Wang, Ruoyao and Jansen, Peter and C. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages =. 2022 , url =

2022

[25] [30]

Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =. 2024 , url =

2024

[26] [31]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[27] [32]

Emergent autonomous scientific research capabilities of large language models

Emergent Autonomous Scientific Research Capabilities of Large Language Models , author =. arXiv preprint arXiv:2304.05332 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[28] [33]

Conference on Robot Learning , year =

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , author =. Conference on Robot Learning , year =

[29] [34]

Dihan, Mahir Labib and Hashem, Tanzima and Ali, Mohammed Eunus and Parvez, Md Rizwan , journal =

[30] [35]

He, Hongliang and Yao, Wenlin and Ma, Kaixin and Yu, Wenhao and Dai, Yong and Zhang, Hongming and Lan, Zhenzhong and Yu, Dong , booktitle =

[31] [36]

2024 , url=

John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

2024

[32] [37]

arXiv preprint arXiv:2508.02085 , year=

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents , author=. arXiv preprint arXiv:2508.02085 , year=

work page arXiv

[33] [38]

2025 , note =

Zhang, Guibin and Fu, Muxin and Wan, Guancheng and Yu, Miao and Wang, Kun and Yan, Shuicheng , booktitle =. 2025 , note =

2025

[34] [39]

Generalizing Experience for Language Agents with Hierarchical

Fan, Shengda and Cong, Xin and Zhang, Zhong and Fu, Yuepeng and Wu, Yesai and Wang, Hao and Zhang, Xinyu and Hu, Enrui and Lin, Yankai , booktitle =. Generalizing Experience for Language Agents with Hierarchical

[35] [40]

Wei, Zhepei and Yao, Wenlin and Liu, Yao and Zhang, Weizhi and Lu, Qin and Qiu, Liang and others , booktitle =

[36] [41]

2026 , url =

Ni, Jingwei and Liu, Yihao and Liu, Xinpeng and Sun, Yutao and Zhou, Mengyu and Cheng, Pengyu and Wang, Dexin and Zhao, Erchao and Jiang, Xiaoxi and Jiang, Guanjun , journal =. 2026 , url =

2026

[37] [42]

Prasad, Archiki and Koller, Alexander and Hartmann, Mareike and Clark, Peter and Sabharwal, Ashish and Bansal, Mohit and Khot, Tushar , booktitle =

[38] [43]

Rozanov, Nikolai and Rei, Marek , booktitle=

[39] [44]

Advances in Neural Information Processing Systems , editor=

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022

[40] [45]

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2023. https://arxiv.org/abs/2305.17126 Large language models as tool makers . arXiv preprint arXiv:2305.17126

work page arXiv 2023

[41] [46]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. https://arxiv.org/abs/2211.10435 PAL : Program-aided language models . arXiv preprint arXiv:2211.10435

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [47]

Shengran Hu, Cong Lu, and Jeff Clune. 2024. https://arxiv.org/abs/2408.08435 Automated design of agentic systems ( ADAS ) . arXiv preprint arXiv:2408.08435

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [48]

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. 2023. Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation (ICRA), pages 9493--9500

2023

[44] [49]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, and 3 others. 2023. https://arxiv.org/abs/2308.03688 AgentBench : Evaluating LLM s as agents . arXiv preprint arXiv:2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [50]

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. 2026. https://arxiv.org/abs/2603.25158 Trace2Skill : Distill trajectory-local lessons into transferable agent skills . arXiv preprint arXiv:2603.25158. Work in progress

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [51]

Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. 2023. https://arxiv.org/abs/2305.14318 CREATOR : Tool creation for disentangling abstract and concrete reasoning of large language models . arXiv preprint arXiv:2305.14318

work page arXiv 2023

[47] [52]

Jieyi Qiu, Xiang Juan, Yan Wang, Lei Yang, Xin Qi, Tao Zhang, Jun Guo, Yun Lu, Zheng Yao, Mei Wang, and 1 others. 2025. https://arxiv.org/abs/2506.14728 AgentDistill : Training-free agent distillation with generalizable MCP boxes . arXiv preprint arXiv:2506.14728

work page arXiv 2025

[48] [53]

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. 2024. https://arxiv.org/abs/2410.06153 AgentSquare : Automatic LLM agent search in modular design space . arXiv preprint arXiv:2410.06153

work page arXiv 2024

[49] [54]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36

2023

[50] [55]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C\^ot\'e, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. https://arxiv.org/abs/2010.03768 ALFWorld : Aligning text and embodied environments for interactive learning . In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2021

[51] [56]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. https://arxiv.org/abs/2305.16291 Voyager: An open-ended embodied agent with large language models . arXiv preprint arXiv:2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [57]

Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. 2024 a . https://arxiv.org/abs/2410.03439 ToolGen : Unified tool retrieval and calling via generation . arXiv preprint arXiv:2410.03439

work page arXiv 2024

[53] [58]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024 b . https://arxiv.org/abs/2402.01030 Executable code actions elicit better llm agents . In ICML

work page arXiv 2024

[54] [59]

Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. 2025. https://arxiv.org/abs/2504.06821 Inducing programmatic skills for agentic tasks . arXiv preprint arXiv:2504.06821

work page arXiv 2025

[55] [60]

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2024 c . https://arxiv.org/abs/2409.07429 Agent workflow memory . arXiv preprint arXiv:2409.07429

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [61]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35

2022

[57] [62]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. https://arxiv.org/abs/2405.15793 SWE -agent: Agent-computer interfaces enable automated software engineering . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [63]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. https://arxiv.org/abs/2210.03629 ReAct : Synergizing reasoning and acting in language models . In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [64]

Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. 2025. https://arxiv.org/abs/2502.04180 Multi-agent architecture search via agentic supernet . arXiv preprint arXiv:2502.04180

work page arXiv 2025

[60] [65]

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, and 1 others. 2024. https://arxiv.org/abs/2410.10762 AFlow : Automating agentic workflow generation . arXiv preprint arXiv:2410.10762

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [66]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, and 1 others. 2025. https://arxiv.org/abs/2504.07079 SkillWeaver : Web agents can self-improve by discovering and honing skills . arXiv preprint arXiv:2504.07079

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [67]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. https://arxiv.org/abs/2307.13854 WebArena : A realistic web environment for building autonomous agents . In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2024