OpenSkill: Open-World Self-Evolution for LLM Agents

Dingjie Song; Hanrong Zhang; Lichao Sun; Lifang He; Philip S. Yu; Ran Xu; Wei Liang; Xiang Li; Yutong Dai; Yuxuan Zhang

arxiv: 2606.06741 · v1 · pith:ZVU5HKAVnew · submitted 2026-06-04 · 💻 cs.AI · cs.CL· cs.LG

OpenSkill: Open-World Self-Evolution for LLM Agents

Zhiling Yan , Dingjie Song , Hanrong Zhang , Wei Liang , Yuxuan Zhang , Yutong Dai , Lifang He , Philip S. Yu

show 3 more authors

Ran Xu Xiang Li Lichao Sun

This is my paper

Pith reviewed 2026-06-28 00:44 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords self-evolving agentsLLM agentsopen-world learningskill synthesisno-supervisionvirtual tasksagent adaptationtransferable skills

0 comments

The pith

LLM agents can bootstrap skills and verifiers from open-world resources alone, without target-task supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agents can self-evolve after deployment even when given only a task prompt and no curated skills, trajectories, or verifier signals. It does so by pulling knowledge and verification anchors from documentation, repositories, and the web, then turning those into skills practiced on self-generated virtual tasks. This matters because most real deployments lack the labeled loops assumed by prior methods. The resulting framework reaches the highest pass rates on three benchmarks across two agents while keeping the no-supervision rule intact, and the skills move between models without retraining.

Core claim

OpenSkill acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint.

What carries the argument

Bootstrapping a self-contained evolution loop that converts open-world anchors into skills and virtual practice tasks without accessing target labels.

If this is right

The approach attains the best automated pass rate on three benchmarks and two agents under the strict no-supervision constraint.
Skills transfer across different base models without model-specific adaptation.
The self-built verifier aligns with ground-truth outcomes despite never accessing target answers.
Open-world resources can simultaneously provide the knowledge to learn and the environment for practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents could continue improving in settings where human feedback or labeled data never becomes available.
The same anchor-based virtual-task construction might apply to other agent domains where external documentation exists but task-specific labels do not.
Verification signals learned from consistency with open resources could reduce reliance on external judges in broader agent systems.

Load-bearing premise

Open-world resources supply reliable grounded knowledge and verification anchors that enable effective skill synthesis and refinement on self-built virtual tasks without introducing fatal errors or biases.

What would settle it

Removing or corrupting access to documentation, repositories, and web resources during the synthesis and virtual-task stages, then checking whether benchmark pass rates fall to the level of non-evolving baselines.

read the original abstract

Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenSkill frames a no-supervision agent evolution setup using open resources but the abstract supplies no methods or checks, so the results stay untestable.

read the letter

The core claim is that an agent can bootstrap skills and its own verifier from docs, repos, and the web, then refine on self-built virtual tasks without ever seeing target answers. That strict constraint is the main new element relative to earlier agent evolution work that still leaned on some form of trajectory or signal.

The paper states the problem cleanly: real deployments often give only a prompt, so any method needing curated data or external verifiers is blocked. It also reports that the resulting skills transfer across models and that the self-verifier matches ground truth on three benchmarks.

The soft spot is exactly the one the stress-test flags. Open-world sources routinely contain outdated APIs, contradictory examples, and outright errors. The abstract gives no mechanism for filtering those before they shape the skills or the verifier, yet still claims alignment with ground truth. Without any methods section, error analysis, or example traces, there is no way to judge whether the reported pass rates reflect genuine progress or just propagated noise. The circularity concern is also live: if the verifier is built from the same noisy anchors, its agreement with held-out answers could be coincidental rather than evidence of correctness.

This is for people already working on autonomous LLM agents who want to think about the zero-supervision limit case. A reader could pull the framing for discussion, but the current write-up does not yet give enough to replicate or refute the results.

I would not send it for peer review yet. The evidence gap is too large; the full paper would need concrete algorithms, ablation on source quality, and verification that the self-verifier is not just fitting to its own construction process.

Referee Report

2 major / 1 minor

Summary. The paper introduces OpenSkill, a framework for open-world self-evolution of LLM agents that bootstraps skills and verification signals from open-world resources (documentation, repositories, web) without target supervision. It synthesizes transferable skills and refines them on self-built virtual tasks. The framework is evaluated on three benchmarks with two target agents, claiming the best automated pass rate under the no-supervision constraint, with skills transferring across models and the self-built verifier aligning with ground truth without accessing it.

Significance. If the results hold under rigorous verification, this work addresses a key gap in autonomous agent adaptation for real deployments lacking curated signals. The no-supervision constraint satisfaction, cross-model transfer, and use of open resources for both knowledge and practice environments represent a substantive advance if the evidence is detailed and reproducible.

major comments (2)

[Abstract] Abstract: the claim that the self-built verifier aligns with ground-truth outcomes despite never accessing them is central to the no-supervision contribution, yet no quantitative alignment metric, agreement rate, or comparison to baselines is supplied; this must be reported explicitly in the results section with controls for how alignment was computed across the three benchmarks.
[Abstract] Abstract: the premise that open-world resources reliably supply grounded verification anchors for skill synthesis and refinement is load-bearing for the entire loop, but the manuscript provides no robustness analysis against documented inaccuracies, version conflicts, or incomplete APIs in such sources; experiments introducing controlled noise into the anchors should be added to the evaluation to test whether errors propagate into the reported pass rates.

minor comments (1)

[Abstract] Abstract: the phrase 'automated pass rate' should be defined on first use or cross-referenced to the evaluation protocol, as its precise computation is not self-evident from the high-level description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our no-supervision claims. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the self-built verifier aligns with ground-truth outcomes despite never accessing them is central to the no-supervision contribution, yet no quantitative alignment metric, agreement rate, or comparison to baselines is supplied; this must be reported explicitly in the results section with controls for how alignment was computed across the three benchmarks.

Authors: We agree that explicit quantitative metrics are required to support the alignment claim. Although the manuscript references analysis showing alignment, no agreement rates or computation details appear in the results. In the revision we will add a dedicated subsection reporting agreement metrics (e.g., percentage agreement and correlation) across all three benchmarks, with explicit controls for how alignment is measured and comparisons to simple baseline verifiers. revision: yes
Referee: [Abstract] Abstract: the premise that open-world resources reliably supply grounded verification anchors for skill synthesis and refinement is load-bearing for the entire loop, but the manuscript provides no robustness analysis against documented inaccuracies, version conflicts, or incomplete APIs in such sources; experiments introducing controlled noise into the anchors should be added to the evaluation to test whether errors propagate into the reported pass rates.

Authors: We acknowledge that the manuscript contains no robustness analysis against inaccuracies in open-world sources. To address this gap we will add controlled-noise experiments that inject simulated version conflicts, incomplete APIs, and factual errors into the anchors and measure downstream effects on skill quality and pass rates. These results will be reported in the evaluation section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; framework claims rest on external empirical evaluation

full rationale

The paper describes an empirical framework for bootstrapping agent skills and verifiers from open-world resources, with no equations, derivations, or first-principles reductions present. Performance claims are supported by benchmark pass rates and post-hoc alignment analysis against held-out ground truth, not by any fitted parameter renamed as prediction or by self-citation chains. The no-supervision constraint and verifier alignment are evaluated externally on target tasks, rendering the reported results independent of the method's internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5755 in / 1332 out tokens · 40153 ms · 2026-06-28T00:44:28.859823+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 14 linked inside Pith

[1]

2023 , eprint =

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models , author =. 2023 , eprint =

2023
[2]

2026 , eprint =

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse , author =. 2026 , eprint =

2026
[3]

2026 , eprint =

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings , author =. 2026 , eprint =

2026
[4]

Zhang , year =

Jingzhi Gong and Ruizhen Gu and Zhiwei Fei and Yazhuo Cao and Lukas Twist and Alina Geiger and Shuo Han and Dominik Sobania and Federica Sarro and Jie M. Zhang , year =. 2604.09297 , archivePrefix =

Pith/arXiv arXiv
[5]

2210.03629 , archivePrefix =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , year =. 2210.03629 , archivePrefix =

Pith/arXiv arXiv
[6]

2022 , eprint =

Large Language Models Are Human-Level Prompt Engineers , author =. 2022 , eprint =

2022
[7]

2023 , eprint =

Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. 2023 , eprint =

2023
[8]

2026 , eprint =

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence , author =. 2026 , eprint =

2026
[9]

2602.12670 , archivePrefix =

Xiangyi Li and Wenbo Chen and Yimin Liu and Shenghan Zheng and Xiaokun Chen and Yifeng He and Yubo Li and Bingran You and Haotian Shen and Jiankai Sun and Shuyi Wang and Binxu Li and Qunhong Zeng and Di Wang and Xuandong Zhao and Yuanli Wang and Roey Ben Chaim and Zonglin Di and Yipeng Gao and Junwei He and Yizhuo He and Liqiang Jing and Luyang Kong and X...

Pith/arXiv arXiv
[10]

2025 , eprint =

Reinforcement Learning for Self-Improving Agent with Skill Library , author =. 2025 , eprint =

2025
[11]

2602.08234 , archivePrefix =

Peng Xia and Jianwen Chen and Hanyang Wang and Jiaqi Liu and Kaide Zeng and Yu Wang and Siwei Han and Yiyang Zhou and Xujiang Zhao and Haifeng Chen and Zeyu Zheng and Cihang Xie and Huaxiu Yao , year =. 2602.08234 , archivePrefix =

Pith/arXiv arXiv
[12]

2302.04761 , archivePrefix =

Timo Schick and Jane Dwivedi-Yu and Roberto Dessi and Roberta Raileanu and Maria Lomeli and Luke Zettlemoyer and Nicola Cancedda and Thomas Scialom , year =. 2302.04761 , archivePrefix =

Pith/arXiv arXiv
[13]

2023 , eprint =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. 2023 , eprint =

2023
[14]

2023 , eprint =

Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search , author =. 2023 , eprint =

2023
[15]

2512.23880 , archivePrefix =

Xu Huang and Junwu Chen and Yuxing Fei and Zhuohan Li and Philippe Schwaller and Gerbrand Ceder , year =. 2512.23880 , archivePrefix =

arXiv
[16]

2504.06188 , archivePrefix =

Fangzhou Li and Pagkratios Tagkopoulos and Ilias Tagkopoulos , year =. 2504.06188 , archivePrefix =

arXiv
[17]

2026 , eprint =

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward , author =. 2026 , eprint =

2026
[18]

2023 , eprint =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. 2023 , eprint =

2023
[19]

Zhang and Chengcheng Wan and Xiaodong Gu , year =

Zimu Wang and Yuling Shi and Mengfan Li and Zijun Liu and Jie M. Zhang and Chengcheng Wan and Xiaodong Gu , year =. 2603.27850 , archivePrefix =

arXiv
[20]

2603.29919 , archivePrefix =

Yudong Gao and Zongjie Li and Yuanyuanyuan and Zimo Ji and Pingchuan Ma and Shuai Wang , year =. 2603.29919 , archivePrefix =

Pith/arXiv arXiv
[21]

2604.02268 , archivePrefix =

Zhengxi Lu and Zhiyuan Yao and Jinyang Wu and Chengcheng Han and Qi Gu and Xunliang Cai and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen , year =. 2604.02268 , archivePrefix =

Pith/arXiv arXiv
[22]

2026 , eprint =

Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills , author =. 2026 , eprint =

2026
[23]

Yu , year =

Hanrong Zhang and Shicheng Fan and Henry Peng Zou and Yankai Chen and Zhenting Wang and Jiayu Zhou and Chengze Li and Wei-Chieh Huang and Yifei Yao and Kening Zheng and Xue Liu and Xiaoxiao Li and Philip S. Yu , year =. 2604.01687 , archivePrefix =

Pith/arXiv arXiv
[24]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
[25]

arXiv preprint arXiv:2312.10997 , volume=

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=

Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2112.09332 , year=

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2506.18096 , year=

Deep research agents: A systematic examination and roadmap , author=. arXiv preprint arXiv:2506.18096 , year=

arXiv
[28]

arXiv preprint arXiv:2203.11171 , year=

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

Pith/arXiv arXiv
[29]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=
[30]

The Innovation , year=

A survey on llm-as-a-judge , author=. The Innovation , year=
[31]

arXiv preprint arXiv:2511.21382 , year=

Large Language Models for Unit Test Generation: Achievements, Challenges, and Opportunities , author=. arXiv preprint arXiv:2511.21382 , year=

arXiv
[32]

International Conference on Learning Representations , volume=

Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=
[33]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
[34]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
[35]

arXiv preprint arXiv:2207.10397 , year=

Codet: Code generation with generated tests , author=. arXiv preprint arXiv:2207.10397 , year=

Pith/arXiv arXiv
[36]

International Conference on Learning Representations , volume=

Teaching large language models to self-debug , author=. International Conference on Learning Representations , volume=
[37]

arXiv preprint arXiv:2605.10999 , year=

SkillGen: Verified Inference-Time Agent Skill Synthesis , author=. arXiv preprint arXiv:2605.10999 , year=

Pith/arXiv arXiv
[38]

arXiv preprint arXiv:2603.01145 , year=

Autoskill: Experience-driven lifelong learning via skill self-evolution , author=. arXiv preprint arXiv:2603.01145 , year=

arXiv
[39]

2025 , howpublished =

Skill Creator: A skill for creating new skills and iteratively improving them , author =. 2025 , howpublished =

2025
[40]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026
[41]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Scienceworld: Is your agent smarter than a 5th grader? , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[42]

arXiv preprint arXiv:2505.23713 , year=

Socialmaze: A benchmark for evaluating social reasoning in large language models , author=. arXiv preprint arXiv:2505.23713 , year=

arXiv
[43]

arXiv preprint arXiv:2603.04448 , year=

Skillnet: Create, evaluate, and connect ai skills , author=. arXiv preprint arXiv:2603.04448 , year=

arXiv
[44]

arXiv preprint arXiv:2603.18743 , year=

Memento-skills: Let agents design agents , author=. arXiv preprint arXiv:2603.18743 , year=

arXiv
[45]

arXiv preprint arXiv:2412.19437 , year=

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

Pith/arXiv arXiv

[1] [1]

2023 , eprint =

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models , author =. 2023 , eprint =

2023

[2] [2]

2026 , eprint =

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse , author =. 2026 , eprint =

2026

[3] [3]

2026 , eprint =

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings , author =. 2026 , eprint =

2026

[4] [4]

Zhang , year =

Jingzhi Gong and Ruizhen Gu and Zhiwei Fei and Yazhuo Cao and Lukas Twist and Alina Geiger and Shuo Han and Dominik Sobania and Federica Sarro and Jie M. Zhang , year =. 2604.09297 , archivePrefix =

Pith/arXiv arXiv

[5] [5]

2210.03629 , archivePrefix =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , year =. 2210.03629 , archivePrefix =

Pith/arXiv arXiv

[6] [6]

2022 , eprint =

Large Language Models Are Human-Level Prompt Engineers , author =. 2022 , eprint =

2022

[7] [7]

2023 , eprint =

Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. 2023 , eprint =

2023

[8] [8]

2026 , eprint =

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence , author =. 2026 , eprint =

2026

[9] [9]

2602.12670 , archivePrefix =

Xiangyi Li and Wenbo Chen and Yimin Liu and Shenghan Zheng and Xiaokun Chen and Yifeng He and Yubo Li and Bingran You and Haotian Shen and Jiankai Sun and Shuyi Wang and Binxu Li and Qunhong Zeng and Di Wang and Xuandong Zhao and Yuanli Wang and Roey Ben Chaim and Zonglin Di and Yipeng Gao and Junwei He and Yizhuo He and Liqiang Jing and Luyang Kong and X...

Pith/arXiv arXiv

[10] [10]

2025 , eprint =

Reinforcement Learning for Self-Improving Agent with Skill Library , author =. 2025 , eprint =

2025

[11] [11]

2602.08234 , archivePrefix =

Peng Xia and Jianwen Chen and Hanyang Wang and Jiaqi Liu and Kaide Zeng and Yu Wang and Siwei Han and Yiyang Zhou and Xujiang Zhao and Haifeng Chen and Zeyu Zheng and Cihang Xie and Huaxiu Yao , year =. 2602.08234 , archivePrefix =

Pith/arXiv arXiv

[12] [12]

2302.04761 , archivePrefix =

Timo Schick and Jane Dwivedi-Yu and Roberto Dessi and Roberta Raileanu and Maria Lomeli and Luke Zettlemoyer and Nicola Cancedda and Thomas Scialom , year =. 2302.04761 , archivePrefix =

Pith/arXiv arXiv

[13] [13]

2023 , eprint =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. 2023 , eprint =

2023

[14] [14]

2023 , eprint =

Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search , author =. 2023 , eprint =

2023

[15] [15]

2512.23880 , archivePrefix =

Xu Huang and Junwu Chen and Yuxing Fei and Zhuohan Li and Philippe Schwaller and Gerbrand Ceder , year =. 2512.23880 , archivePrefix =

arXiv

[16] [16]

2504.06188 , archivePrefix =

Fangzhou Li and Pagkratios Tagkopoulos and Ilias Tagkopoulos , year =. 2504.06188 , archivePrefix =

arXiv

[17] [17]

2026 , eprint =

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward , author =. 2026 , eprint =

2026

[18] [18]

2023 , eprint =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. 2023 , eprint =

2023

[19] [19]

Zhang and Chengcheng Wan and Xiaodong Gu , year =

Zimu Wang and Yuling Shi and Mengfan Li and Zijun Liu and Jie M. Zhang and Chengcheng Wan and Xiaodong Gu , year =. 2603.27850 , archivePrefix =

arXiv

[20] [20]

2603.29919 , archivePrefix =

Yudong Gao and Zongjie Li and Yuanyuanyuan and Zimo Ji and Pingchuan Ma and Shuai Wang , year =. 2603.29919 , archivePrefix =

Pith/arXiv arXiv

[21] [21]

2604.02268 , archivePrefix =

Zhengxi Lu and Zhiyuan Yao and Jinyang Wu and Chengcheng Han and Qi Gu and Xunliang Cai and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen , year =. 2604.02268 , archivePrefix =

Pith/arXiv arXiv

[22] [22]

2026 , eprint =

Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills , author =. 2026 , eprint =

2026

[23] [23]

Yu , year =

Hanrong Zhang and Shicheng Fan and Henry Peng Zou and Yankai Chen and Zhenting Wang and Jiayu Zhou and Chengze Li and Wei-Chieh Huang and Yifei Yao and Kening Zheng and Xue Liu and Xiaoxiao Li and Philip S. Yu , year =. 2604.01687 , archivePrefix =

Pith/arXiv arXiv

[24] [24]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

[25] [25]

arXiv preprint arXiv:2312.10997 , volume=

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=

Pith/arXiv arXiv

[26] [26]

arXiv preprint arXiv:2112.09332 , year=

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2506.18096 , year=

Deep research agents: A systematic examination and roadmap , author=. arXiv preprint arXiv:2506.18096 , year=

arXiv

[28] [28]

arXiv preprint arXiv:2203.11171 , year=

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

Pith/arXiv arXiv

[29] [29]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

[30] [30]

The Innovation , year=

A survey on llm-as-a-judge , author=. The Innovation , year=

[31] [31]

arXiv preprint arXiv:2511.21382 , year=

Large Language Models for Unit Test Generation: Achievements, Challenges, and Opportunities , author=. arXiv preprint arXiv:2511.21382 , year=

arXiv

[32] [32]

International Conference on Learning Representations , volume=

Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

[33] [33]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

[34] [34]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

[35] [35]

arXiv preprint arXiv:2207.10397 , year=

Codet: Code generation with generated tests , author=. arXiv preprint arXiv:2207.10397 , year=

Pith/arXiv arXiv

[36] [36]

International Conference on Learning Representations , volume=

Teaching large language models to self-debug , author=. International Conference on Learning Representations , volume=

[37] [37]

arXiv preprint arXiv:2605.10999 , year=

SkillGen: Verified Inference-Time Agent Skill Synthesis , author=. arXiv preprint arXiv:2605.10999 , year=

Pith/arXiv arXiv

[38] [38]

arXiv preprint arXiv:2603.01145 , year=

Autoskill: Experience-driven lifelong learning via skill self-evolution , author=. arXiv preprint arXiv:2603.01145 , year=

arXiv

[39] [39]

2025 , howpublished =

Skill Creator: A skill for creating new skills and iteratively improving them , author =. 2025 , howpublished =

2025

[40] [40]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026

[41] [41]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Scienceworld: Is your agent smarter than a 5th grader? , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022

[42] [42]

arXiv preprint arXiv:2505.23713 , year=

Socialmaze: A benchmark for evaluating social reasoning in large language models , author=. arXiv preprint arXiv:2505.23713 , year=

arXiv

[43] [43]

arXiv preprint arXiv:2603.04448 , year=

Skillnet: Create, evaluate, and connect ai skills , author=. arXiv preprint arXiv:2603.04448 , year=

arXiv

[44] [44]

arXiv preprint arXiv:2603.18743 , year=

Memento-skills: Let agents design agents , author=. arXiv preprint arXiv:2603.18743 , year=

arXiv

[45] [45]

arXiv preprint arXiv:2412.19437 , year=

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

Pith/arXiv arXiv