SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

Ivor W. Tsang; Pengfei Zhou; Tong Bai; Wangbo Zhao; Xingrui Yu; Yang You; Zhenglin Wan

arxiv: 2606.03056 · v1 · pith:KABWFSIMnew · submitted 2026-06-02 · 💻 cs.AI

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

Tong Bai , Zhenglin Wan , Pengfei Zhou , Xingrui Yu , Wangbo Zhao , Yang You , Ivor W. Tsang This is my paper

Pith reviewed 2026-06-28 10:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords skill selectionLLM agentsdirected graphsself-evolving systemsstructural retrievalagent planningbenchmark evaluation

0 comments

The pith

LLM agents improve skill selection by maintaining an evolving typed directed graph of inter-skill relations that they query and update during execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that as skill libraries scale, selecting the right skills requires modeling their dependencies, conflicts, and specializations as a typed directed graph rather than using embedding similarity alone. This graph is exposed to the agent as a callable interface for structural retrieval, including neighbors and conflict signals, and evolves through a propose-then-commit protocol where the agent registers edges backed by execution outcomes. A sympathetic reader would care because this structural approach maintains performance as the pool grows ten times larger, where fixed pipelines degrade, leading to higher success and reward on benchmarks like ALFWorld and SkillsBench.

Core claim

SkillDAG models inter-skill relationships as a typed directed graph and exposes it to an LLM agent as an inference-time, agent-callable structural retrieval interface that is queried and evolved during execution, with each search returning vector matches, typed-edge neighbors, and conflict signals, and a propose-then-commit protocol allowing the agent to register execution-backed edges so the graph accumulates structure across episodes.

What carries the argument

The typed directed graph used as a structural retrieval interface that the agent queries and evolves via propose-then-commit edge registration based on execution outcomes.

If this is right

Candidate ranking stays robust as the skill pool grows tenfold.
Set-monotone online edits enlarge ground-truth recall without evicting prior hits.
The performance gains transfer to different underlying LLMs.
Intrinsic retrieval quality measured by Ret@K improves under matched queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such graphs could allow agents to discover and exploit higher-order skill compositions over time.
The approach might extend to domains with other relational structures, like tool dependencies in software engineering.
Error propagation in edge registration could be mitigated by periodic validation mechanisms not described in the work.

Load-bearing premise

Execution outcomes reliably produce accurate typed edges without systematic bias that would degrade future retrieval quality.

What would settle it

A controlled experiment where the evolving graph shows no advantage over a static graph baseline on tasks requiring conflict resolution in a large skill pool would falsify the central claim.

read the original abstract

As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: skills depend on, conflict with, specialize, or duplicate one another, a structure invisible to both full enumeration and embedding similarity. We present SkillDAG, which models inter-skill relationships as a typed directed graph and exposes it to an LLM agent as an inference-time, agent-callable structural retrieval interface, queried and evolved during execution rather than baked into a fixed retrieval pipeline: each search returns vector matches, typed-edge neighbors, and conflict signals, and a propose-then-commit protocol lets the agent register execution-backed edges so the graph accumulates structure across episodes. On ALFWorld and SkillsBench with MiniMax-M2.7, SkillDAG reaches 67.1% success and 27.3% reward, exceeding the strongest reported Graph-of-Skills baseline by +12.8 and +8.6 points; the advantage ports to gpt-5.2-codex, and intrinsic SkillsBench Ret@K rises from 65.5 to 78.2 under matched queries. These gains trace to isolable mechanisms: candidate ranking that stays robust as the pool grows 10x where a fixed seeding-diffusion pipeline degrades, and set-monotone online edits that enlarge ground-truth recall without evicting prior hits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillDAG adds an inference-time typed graph with propose-then-commit updates for skill selection, but the performance claims lack the experimental details needed to judge them.

read the letter

The main takeaway is that SkillDAG builds a typed directed graph of skills (depends-on, conflicts-with, etc.) and lets the LLM agent query it structurally at runtime while also proposing and committing new edges from execution results. This moves beyond fixed Graph-of-Skills setups by making the structure both queryable and updatable during episodes.

The approach is new in combining the agent-callable interface with the online evolution protocol, and the paper shows the expected practical benefit: candidate ranking stays stable as the skill pool grows tenfold, and the monotone edits improve recall without dropping prior hits. The reported numbers on ALFWorld and SkillsBench (67.1% success, +12.8 over the strongest baseline) are the kind of engineering signal that would matter to people scaling agent tool libraries.

The soft spot is the evaluation. The abstract attributes the gains to the ranking and edit mechanisms but gives no protocol, baseline code, ablation tables, or statistical checks. Without those, it is impossible to tell whether the propose-then-commit step actually produces clean edges or whether noisy or partial execution traces introduce systematic errors that later queries then amplify. The stress-test worry about cumulative degradation from bad edges is plausible on the current description and needs direct evidence to dismiss.

This is for researchers working on LLM agents that must pick from growing skill sets rather than for core theory audiences. A reader who already runs large tool libraries might want to try the retrieval interface even if the evolution part stays unproven.

It deserves peer review so the experimental section can be examined for controls on edge quality and for reproducibility of the reported margins.

Referee Report

3 major / 1 minor

Summary. SkillDAG models inter-skill relationships in large LLM skill libraries as a typed directed graph (depends-on, conflicts-with, specializes, duplicates) and exposes it via an inference-time structural retrieval interface. An LLM agent queries vector matches plus typed neighbors and conflict signals during execution; a propose-then-commit protocol registers execution-backed edges so the graph evolves across episodes. On ALFWorld and SkillsBench the method reports 67.1% success / 27.3% reward with MiniMax-M2.7 (exceeding the strongest Graph-of-Skills baseline by +12.8 / +8.6 points), with the advantage transferring to gpt-5.2-codex and intrinsic Ret@K rising from 65.5 to 78.2; gains are attributed to ranking robustness under 10x pool growth and set-monotone online edits.

Significance. If the performance numbers and attribution to structural retrieval plus self-evolution hold under rigorous controls, the work would be significant for scaling agent skill selection beyond embedding similarity. The self-evolving typed graph and agent-callable interface address a genuine structural gap; the reported robustness to pool growth and set-monotone property would be valuable if independently verified. However, the absence of protocol, baseline, and ablation details in the manuscript prevents assessing whether these gains are reproducible or artifactual.

major comments (3)

[Abstract] Abstract: the headline gains (+12.8 success, +8.6 reward, +12.7 Ret@K) are attributed to candidate ranking that remains robust at 10x pool size and to set-monotone online edits, yet the manuscript supplies no experimental protocol, baseline implementations, statistical tests, or ablation tables that would allow verification of this attribution.
[Abstract] Abstract (propose-then-commit protocol): the central mechanism registers directed typed edges from agent execution outcomes, but no description is given of how edge accuracy is validated, how partial observability or spurious correlations are handled, or what safeguards prevent error propagation that would degrade subsequent retrieval quality—the exact risk highlighted by the weakest assumption.
[Abstract] Abstract: the claim that the graph is updated from execution outcomes on external benchmarks provides grounding, yet the success metric itself depends on the evolving graph, creating a circularity that is not addressed by any reported control experiment or independent validation set.

minor comments (1)

[Abstract] The abstract mentions 'intrinsic SkillsBench Ret@K' without defining the metric or the query-matching procedure used to compute it.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback highlighting the need for greater transparency in experimental details and controls. We will revise the manuscript to incorporate the suggested additions and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the headline gains (+12.8 success, +8.6 reward, +12.7 Ret@K) are attributed to candidate ranking that remains robust at 10x pool size and to set-monotone online edits, yet the manuscript supplies no experimental protocol, baseline implementations, statistical tests, or ablation tables that would allow verification of this attribution.

Authors: We agree with the referee that additional details are necessary to allow verification of the attribution. The revised manuscript will include a comprehensive experimental protocol section, descriptions of baseline implementations, results of statistical tests, and ablation tables that isolate the contributions of the structural retrieval and self-evolution mechanisms. revision: yes
Referee: [Abstract] Abstract (propose-then-commit protocol): the central mechanism registers directed typed edges from agent execution outcomes, but no description is given of how edge accuracy is validated, how partial observability or spurious correlations are handled, or what safeguards prevent error propagation that would degrade subsequent retrieval quality—the exact risk highlighted by the weakest assumption.

Authors: We will provide a detailed description of the propose-then-commit protocol in the revised manuscript, including how edge accuracy is validated, handling of partial observability or spurious correlations, and safeguards to prevent error propagation. revision: yes
Referee: [Abstract] Abstract: the claim that the graph is updated from execution outcomes on external benchmarks provides grounding, yet the success metric itself depends on the evolving graph, creating a circularity that is not addressed by any reported control experiment or independent validation set.

Authors: This is a valid concern regarding potential circularity. We will add control experiments that use a static graph version and report results on an independent validation set to demonstrate that the performance gains are attributable to the proposed approach rather than the dependency in the metric. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results grounded in external benchmarks

full rationale

The paper describes SkillDAG as an empirical system whose graph evolves via propose-then-commit from agent execution traces on ALFWorld and SkillsBench. Reported metrics (success rate, reward, Ret@K) are measured directly on those same external task distributions against fixed baselines. No equations, self-citations, or derivations are shown that reduce a claimed prediction or uniqueness result to a fitted input or prior self-work by construction. The method is therefore self-contained against the stated benchmarks; any concerns about edge accuracy or error propagation belong to correctness risk rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that inter-skill relationships are stable enough to be captured by typed edges and that agent execution provides reliable feedback for edge registration. No free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Inter-skill relationships can be represented as typed directed edges that remain useful across episodes.
Invoked by the design of the graph and the propose-then-commit protocol.

invented entities (1)

Typed skill graph with propose-then-commit updates no independent evidence
purpose: To expose structural retrieval and accumulate execution-backed edges at inference time.
Core modeling choice introduced by the paper; no independent falsifiable evidence supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5787 in / 1353 out tokens · 32185 ms · 2026-06-28T10:33:11.812778+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 2 canonical work pages

[1]

CUA-Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123,

Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, and Leon Xu. CUA-Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123,

arXiv
[2]

From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130,

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130,

Pith/arXiv arXiv
[3]

org/abs/2601.04786

URLhttps://arxiv. org/abs/2601.04786. Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. AutoGuide: Automated generation and selection of context-aware guidelines for large language model agents. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv
[4]

A survey of self- evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. A survey of se...

Pith/arXiv arXiv
[5]

Guanyu Jiang, Zhaochen Su, Xiaoye Qu, and Yi R. Fung. XSkill: Continual learning from experience and skills in multimodal agents.arXiv preprint arXiv:2603.12056,

arXiv
[6]

URLhttps://arxiv.org/abs/2508.01415. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS,

arXiv
[7]

SkillNet: Create, evaluate, and connect AI skills.arXiv preprint arXiv:2603.04448,

Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, and Mengru Wang. SkillNet: Create, evaluate, and connect AI skills.arXiv preprint arXiv:2603.04448,

arXiv
[8]

Graph of skills: Dependency-aware structural retrieval for massive agent skills.arXiv preprint arXiv:2604.05333,

Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, and Lichao Sun. Graph of skills: Dependency-aware structural retrieval for massive agent skills.arXiv preprint arXiv:2604.05333,

Pith/arXiv arXiv
[9]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,

doi: 10.1109/TPAMI.2018.2889473. Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1109/tpami.2018.2889473 2018
[10]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560,

Pith/arXiv arXiv
[11]

G., Zhang, T., Wang, X., and Gonzalez, J

doi: 10.52202/079017-4020. Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The Berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InInternational Conference on Machine Learning (ICML),

work page doi:10.52202/079017-4020
[12]

Nils Reimers and Iryna Gurevych

URL https: //arxiv.org/abs/2601.01569. Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese bert-networks. InEMNLP,

arXiv
[13]

Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models

Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, and Zhaochun Ren. Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models. arXiv preprint arXiv:2503.01763,

arXiv
[14]

Executable code actions elicit better LLM agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InICML, 2024b. Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory.arXiv preprint arXiv:2409.07429, 2024c. Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. JARVI...

Pith/arXiv arXiv
[15]

GraSP: Graph-structured skill compositions for LLM agents.arXiv preprint arXiv:2604.17870,

14 SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale Tianle Xia, Lingxiang Hu, Yiding Sun, Ming Xu, Lan Xu, Siying Wang, Wei Xu, and Jie Jiang. GraSP: Graph-structured skill compositions for LLM agents.arXiv preprint arXiv:2604.17870,

Pith/arXiv arXiv
[16]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, 2023a. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2...

Pith/arXiv arXiv

[1] [1]

CUA-Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123,

Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, and Leon Xu. CUA-Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123,

arXiv

[2] [2]

From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130,

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130,

Pith/arXiv arXiv

[3] [3]

org/abs/2601.04786

URLhttps://arxiv. org/abs/2601.04786. Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. AutoGuide: Automated generation and selection of context-aware guidelines for large language model agents. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv

[4] [4]

A survey of self- evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. A survey of se...

Pith/arXiv arXiv

[5] [5]

Guanyu Jiang, Zhaochen Su, Xiaoye Qu, and Yi R. Fung. XSkill: Continual learning from experience and skills in multimodal agents.arXiv preprint arXiv:2603.12056,

arXiv

[6] [6]

URLhttps://arxiv.org/abs/2508.01415. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS,

arXiv

[7] [7]

SkillNet: Create, evaluate, and connect AI skills.arXiv preprint arXiv:2603.04448,

Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, and Mengru Wang. SkillNet: Create, evaluate, and connect AI skills.arXiv preprint arXiv:2603.04448,

arXiv

[8] [8]

Graph of skills: Dependency-aware structural retrieval for massive agent skills.arXiv preprint arXiv:2604.05333,

Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, and Lichao Sun. Graph of skills: Dependency-aware structural retrieval for massive agent skills.arXiv preprint arXiv:2604.05333,

Pith/arXiv arXiv

[9] [9]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,

doi: 10.1109/TPAMI.2018.2889473. Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1109/tpami.2018.2889473 2018

[10] [10]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560,

Pith/arXiv arXiv

[11] [11]

G., Zhang, T., Wang, X., and Gonzalez, J

doi: 10.52202/079017-4020. Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The Berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InInternational Conference on Machine Learning (ICML),

work page doi:10.52202/079017-4020

[12] [12]

Nils Reimers and Iryna Gurevych

URL https: //arxiv.org/abs/2601.01569. Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese bert-networks. InEMNLP,

arXiv

[13] [13]

Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models

Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, and Zhaochun Ren. Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models. arXiv preprint arXiv:2503.01763,

arXiv

[14] [14]

Executable code actions elicit better LLM agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InICML, 2024b. Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory.arXiv preprint arXiv:2409.07429, 2024c. Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. JARVI...

Pith/arXiv arXiv

[15] [15]

GraSP: Graph-structured skill compositions for LLM agents.arXiv preprint arXiv:2604.17870,

14 SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale Tianle Xia, Lingxiang Hu, Yiding Sun, Ming Xu, Lan Xu, Siying Wang, Wei Xu, and Jie Jiang. GraSP: Graph-structured skill compositions for LLM agents.arXiv preprint arXiv:2604.17870,

Pith/arXiv arXiv

[16] [16]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, 2023a. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2...

Pith/arXiv arXiv