Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

Cedric Lim; I. Danial Bin Sharudin; Laura Wynter; M. Danish Lim; Wen Han Chen

arxiv: 2606.06923 · v1 · pith:QZZ6PXAEnew · submitted 2026-06-05 · 💻 cs.AI · cs.SE

Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

M. Danish Lim , I. Danial Bin Sharudin , Wen Han Chen , Cedric Lim , Laura Wynter This is my paper

Pith reviewed 2026-06-27 21:57 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords declarative agentstool useorchestrationknowledge-grounded workflowsAI agentsstate machinesretrieval qualityprocedural tasks

0 comments

The pith

Declarative skill files improve AI agent accuracy on procedural tasks and reduce orchestration errors when retrieval quality is high.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether AI agents that read natural-language skill files can orchestrate tool use in customer-service workflows over unstructured knowledge bases better than agents driven by explicit programmatic state machines. It compares three designs on five language models across two retrieval regimes and finds retrieval quality to be the dominant performance limiter. When retrieval is reliable, the declarative approach yields measurable gains in task accuracy and fewer control-flow mistakes while the imperative design shows no consistent benefit.

Core claim

Declarative agents that append three domain-specific natural-language skill files to the system prompt and decide their own control flow outperform both an imperative state-machine agent and an unscaffolded baseline on procedural accuracy and orchestration compliance once retrieval supplies complete evidence; all three agents degrade sharply when evidence is incomplete or skewed.

What carries the argument

Declarative skill files: natural-language descriptions of domain skills that the agent reads at inference time to generate its own control flow.

If this is right

High-quality retrieval is required before any orchestration method can deliver reliable gains.
Declarative skill files reduce the need for hand-coded phase logic in procedural workflows.
Imperative state machines add structural complexity without delivering proportional reliability improvements.
Skill-file performance is tied to the completeness of the underlying knowledge base rather than to model scale alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Natural-language skill descriptions may scale more easily across new domains than hand-written state machines.
The same retrieval-quality bottleneck would likely appear in any tool-use setting that depends on external unstructured data.
Testing whether skill files remain effective when the underlying model is smaller or when workflows become longer could reveal limits on the declarative approach.

Load-bearing premise

The three agent designs represent meaningfully distinct orchestration paradigms whose performance gaps can be isolated from differences in prompt engineering or model behavior.

What would settle it

A controlled run in which declarative agents show no accuracy or compliance advantage over the imperative state machine even when retrieval returns every relevant document without skew.

Figures

Figures reproduced from arXiv: 2606.06923 by Cedric Lim, I. Danial Bin Sharudin, Laura Wynter, M. Danish Lim, Wen Han Chen.

**Figure 2.** Figure 2: The state graph of the ImperativeAgent return a result. As the benchmark tool names contain random four-digit suffixes (e.g. close bank account 7392) that cannot be guessed, this skill file emphasises that KB search is a hard precondition for state-changing tools. The behaviour of our DeclarativeAgent proceeds as follows. At each turn. the DeclarativeAgent appends the incoming message to its state, concate… view at source ↗

**Figure 3.** Figure 3: Pass1 versus Cost/Task across all 30 (5 models x 3 agents x 2 retrieval types) combinations. Filled symbols are golden retrieval; empty symbols are embedding retrieval. The DeclarativeAgent (squares) are an upper envelope of the golden frontier except for Gemini-FlashLite. Embedding-retrieval (empty shapes) and ImparativeAgent (triangles) are well inside the envelope [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗

read the original abstract

We study orchestration mechanisms for tool-using AI agents in realistic customer-service workflows over an unstructured knowledge base. We argue that declarative agents -- AI agents equipped with natural-language skill files appended to the system prompt -- are an effective orchestration paradigm. Concretely, we compare (i) a DeclarativeAgent that reads three domain-specific skill files at inference time and decides its own control flow, (ii) an ImperativeAgent based on a programmatic state machine with explicit phases, and (iii) an unscaffolded baseline agent modeled after the $\tau$-Knowledge benchmark agent. Our ImperativeAgent is motivated by externalised-control inference as in Recursive Language Models and graph-based orchestration frameworks. We formalise the three agents as policy classes within a decentralised partially-observable Markov decision process and analyse their information-theoretic and structural properties; we then test the predicted differences empirically on five language models and two retrieval regimes. Our results show that retrieval quality is a dominant bottleneck for AI agents: when evidence is incomplete or skewed, all agents degrade substantially, and skill files cannot recover lost performance. Under high-quality retrieval, however, declarative skills consistently improve accuracy on procedural tasks and reduce orchestration errors, while the imperative state machine's brittleness does not reliably improve task success or compliance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows declarative skill files can beat state machines on procedural tasks when retrieval is strong, but the gains may come from better prompt detail rather than the declarative/imperative split itself.

read the letter

The key point here is that declarative natural-language skill files improve accuracy and cut orchestration errors compared to an imperative state machine when retrieval is high-quality, while all approaches collapse under poor retrieval. That matches the abstract's main empirical message.

What stands out as new is the direct head-to-head on five models across two retrieval regimes, plus the Dec-POMDP framing that treats the three designs as distinct policy classes. The work does a clean job of isolating retrieval quality as the dominant factor and showing that skill files help on procedural customer-service tasks where the baseline and state machine fall short.

The soft spot is the one the stress-test note flags: nothing in the abstract confirms that prompt length, specificity, or tuning were matched across the DeclarativeAgent (three appended files), the ImperativeAgent (programmatic phases), and the unscaffolded baseline. If the declarative version simply supplies more explicit guidance in natural language while the state machine uses a brittle encoding, the measured edge could be prompt engineering rather than the control-flow paradigm. The lack of reported metrics, error bars, or statistical tests in the abstract makes it hard to judge effect sizes or robustness.

This is useful reading for people building agents for knowledge-grounded workflows who want a practical comparison of orchestration styles. It is not reshaping the broader field, but the targeted contrast and the retrieval bottleneck result are worth a referee's time. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper claims that declarative agents equipped with natural-language skill files appended to the system prompt are an effective orchestration paradigm for tool-using AI agents in customer-service workflows over unstructured knowledge bases. It compares three designs—DeclarativeAgent (three domain-specific skill files), ImperativeAgent (programmatic state machine with explicit phases), and an unscaffolded baseline—formalized as distinct policy classes in a decentralised partially-observable Markov decision process (Dec-POMDP). The work analyzes their information-theoretic and structural properties and reports empirical results on five language models across two retrieval regimes, concluding that retrieval quality is the dominant bottleneck but that, under high-quality retrieval, declarative skills improve accuracy on procedural tasks and reduce orchestration errors while the imperative state machine does not.

Significance. If the central empirical claim holds after proper controls, the work would offer a useful comparison of orchestration mechanisms grounded in a Dec-POMDP formalization, with potential implications for designing more robust tool-use agents. The emphasis on retrieval quality as a bottleneck is a valuable practical observation, and the distinction between declarative and imperative control could inform future agent scaffolding if the designs are shown to be isolated from confounds.

major comments (3)

[abstract and agent design section] Agent design descriptions (abstract and §3): the three agents are presented as representing distinct orchestration paradigms, yet no evidence is supplied that prompt length, level of detail, or explicitness of procedural guidance were matched; the DeclarativeAgent appends three skill files while the ImperativeAgent uses a programmatic encoding, so any measured advantage could be attributable to prompt quality rather than the declarative/imperative distinction formalized in the Dec-POMDP policy classes.
[abstract and results section] Empirical results (abstract and results section): the claimed improvements in accuracy and reduction in orchestration errors under high-quality retrieval are summarized without reporting concrete metrics, statistical tests, error bars, or per-model/per-regime breakdowns, making it impossible to assess whether the differences are reliable or driven by the intended variable.
[§4] Dec-POMDP formalization (§4): the information-theoretic and structural properties derived for the three policy classes are not shown to predict or explain the specific empirical pattern that declarative skills succeed where the imperative state machine fails; the formal analysis therefore appears disconnected from the load-bearing empirical claim.

minor comments (2)

[abstract] The abstract introduces terms such as 'orchestration errors' and 'compliance' without brief definitions, which would aid readability for readers outside the immediate subfield.
[§4] Notation for the Dec-POMDP components (e.g., observation and action spaces for each policy class) could be presented more explicitly in a single table or equation block for easier cross-reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and outline revisions to improve clarity and rigor.

read point-by-point responses

Referee: [abstract and agent design section] Agent design descriptions (abstract and §3): the three agents are presented as representing distinct orchestration paradigms, yet no evidence is supplied that prompt length, level of detail, or explicitness of procedural guidance were matched; the DeclarativeAgent appends three skill files while the ImperativeAgent uses a programmatic encoding, so any measured advantage could be attributable to prompt quality rather than the declarative/imperative distinction formalized in the Dec-POMDP policy classes.

Authors: We agree that the manuscript does not report or control for prompt length, level of detail, or explicitness of procedural guidance across agents. The designs intentionally contrast natural-language declarative skill files against a programmatic imperative state machine, as formalized in the Dec-POMDP policy classes. To address potential confounds, we will add to §3 a comparison of system-prompt token counts, skill-file content, and state-machine code structure, plus an appendix with full prompt examples. This will enable readers to evaluate whether differences arise from the declarative/imperative distinction or from prompt characteristics, while preserving the core paradigm comparison. revision: partial
Referee: [abstract and results section] Empirical results (abstract and results section): the claimed improvements in accuracy and reduction in orchestration errors under high-quality retrieval are summarized without reporting concrete metrics, statistical tests, error bars, or per-model/per-regime breakdowns, making it impossible to assess whether the differences are reliable or driven by the intended variable.

Authors: The results section already contains per-model and per-regime accuracy tables and orchestration-error breakdowns. However, the abstract and high-level claims omit specific numbers, statistical tests, and error bars. We will revise the abstract to report key quantitative results (e.g., accuracy deltas under high-quality retrieval) and add error bars plus notes on statistical tests (paired comparisons across models) to the figures and text in §5. These changes will make the empirical claims directly verifiable. revision: yes
Referee: [§4] Dec-POMDP formalization (§4): the information-theoretic and structural properties derived for the three policy classes are not shown to predict or explain the specific empirical pattern that declarative skills succeed where the imperative state machine fails; the formal analysis therefore appears disconnected from the load-bearing empirical claim.

Authors: The information-theoretic analysis derives policy entropy and observability properties intended to explain why declarative policies reduce orchestration complexity. We acknowledge that the manuscript does not explicitly connect these properties to the observed pattern (declarative success versus imperative failure). We will add a bridging paragraph in §4.3 that maps the formal results (e.g., lower state-space entropy in declarative policies) to the empirical reductions in orchestration errors under high-quality retrieval, thereby tightening the link between theory and experiment. revision: yes

Circularity Check

0 steps flagged

No significant circularity; formalization and empirical tests are independent

full rationale

The paper's central chain formalizes three agent designs (DeclarativeAgent, ImperativeAgent, baseline) as distinct policy classes inside a Dec-POMDP, derives information-theoretic and structural properties from that formalization, and then tests the predicted differences on five models and two retrieval regimes. No equation or step reduces a claimed performance difference to a fitted parameter, a self-citation, or a quantity defined by the target result itself. The Dec-POMDP policy classes are presented as an external modeling choice whose consequences are then measured empirically; retrieval quality is treated as an independent variable. This structure is self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on the standard Dec-POMDP formalism to analyze agent properties and on the existence of distinct declarative versus imperative orchestration mechanisms; no free parameters, new entities, or ad-hoc axioms are introduced beyond these background concepts.

axioms (1)

standard math Agents can be formalized as policy classes within a decentralised partially-observable Markov decision process
Invoked to analyse information-theoretic and structural properties of the three agent types.

pith-pipeline@v0.9.1-grok · 5766 in / 1374 out tokens · 21332 ms · 2026-06-27T21:57:45.685146+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use
cs.AI 2026-07 unverdicted novelty 6.0

SkillCoach introduces self-evolving rubrics derived from rollouts to evaluate and supervise four process dimensions of agentic skill-use separately from outcome success.

Reference graph

Works this paper leans on

17 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

τ- Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, and Victor Barres. τ- Knowledge: Evaluating Conversational Agents over Unstructured Knowledge. Sierra Research / Princeton, arXiv:2603.04370, 2026.https://arxiv.org/abs/2603.04370

work page arXiv 2026
[2]

τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., Shinn, N., Razavi, P., Narasimhan, K. τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. ICLR, 2025. 13

2025
[3]

τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Codebase)

Sierra Research. τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Codebase). GitHub repository, 2025. https://github.com/sierra-research/ tau2-bench

2025
[4]

Mathematics of Operations Research, 27(4):819–840, 2002

Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.The Complexity of Decentralized Control of Markov Decision Processes. Mathematics of Operations Research, 27(4):819–840, 2002

2002
[5]

https://agentskills.io, 2025

Anthropic.Agent Skills: Composable, Model-Read Procedural Knowledge for LLM Agents. https://agentskills.io, 2025

2025
[6]

Firecrawl Blog, 2026

Firecrawl.How SKILL.md Files Work and Why They’re Everywhere. Firecrawl Blog, 2026. https://www.firecrawl.dev/blog/agent-skills

2026
[7]

LlamaIndex Blog, 2026

LlamaIndex.Files for AI Agents: Context, Search, Skills Guide. LlamaIndex Blog, 2026. https://www.llamaindex.ai/blog/files-are-all-you-need

2026
[8]

Recursive Language Models

Anonymous.Recursive Language Models. arXiv:2512.24601, 2025. https://arxiv.org/abs/ 2512.24601

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Blog post, 2025

Zhang, A.Recursive Language Models. Blog post, 2025. https://alexzhang13.github.io/ blog/2025/rlm/

2025
[10]

et al.LangGraph: Building Stateful, Multi-Actor Applications with LLMs

Chase, H. et al.LangGraph: Building Stateful, Multi-Actor Applications with LLMs. LangChain, 2024.https://langchain-ai.github.io/langgraph/

2024
[11]

ICLR, 2023

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023

2023
[12]

EACL Industry Track, 2026.https://aclanthology.org/2026.eacl-industry

[Redacted for blind review].Benchmarking Customer Support LLM Agents for Business- Adherence. EACL Industry Track, 2026.https://aclanthology.org/2026.eacl-industry. 15.pdf

2026
[13]

Toloka AI Blog, 2026

Toloka AI.TAU-bench extension: benchmarking policy-aware agents in realistic settings. Toloka AI Blog, 2026. https://toloka.ai/blog/ tau-bench-extension-benchmarking-policy-aware-agents-in-realistic-settings/

2026
[14]

τ 3-Bench: Advancing Agent Benchmarking to Knowledge and Voice

Sierra Research. τ 3-Bench: Advancing Agent Benchmarking to Knowledge and Voice. Sierra Blog, 2026. https://sierra.ai/blog/ bench-advancing-agent-benchmarking-to-knowledge-and-voice

2026
[15]

Communications of the ACM, 5(11):558– 562, 1962

Kahn, A.B.Topological sorting of large networks. Communications of the ACM, 5(11):558– 562, 1962. Appendix 9.1 Running the Experiments A pilot run (5 tasks, 2 conditions) and full experiment (97 tasks, 4 conditions) are available via the project Makefile: 1# 5 - task pilot 2make pilot 3 4# Full 97 - task e x p e r i m e n t 5make e x p e r i m e n t 14 9....

1962
[16]

Request credit limit increase
[17]

\n < task_queue >\ n

File t r a n s a c t i o n dispute 4E ND _T AS KS The parsed list is stored in state.pending tasks and injected into the EXECUTION phase instruction as a<task queue>hint: 1q u e u e _ h i n t = ( 2" \n < task_queue >\ n " 3f " Pending : ␣ { ’ , ␣ ’. join ( f ’{ i +1}. ␣ { t } ’ ␣ for ␣i , ␣ t ␣ in ␣ en um er at e ( state . p e n d i n g _ t a s k s ) ) }\...

[1] [1]

τ- Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, and Victor Barres. τ- Knowledge: Evaluating Conversational Agents over Unstructured Knowledge. Sierra Research / Princeton, arXiv:2603.04370, 2026.https://arxiv.org/abs/2603.04370

work page arXiv 2026

[2] [2]

τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., Shinn, N., Razavi, P., Narasimhan, K. τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. ICLR, 2025. 13

2025

[3] [3]

τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Codebase)

Sierra Research. τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Codebase). GitHub repository, 2025. https://github.com/sierra-research/ tau2-bench

2025

[4] [4]

Mathematics of Operations Research, 27(4):819–840, 2002

Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.The Complexity of Decentralized Control of Markov Decision Processes. Mathematics of Operations Research, 27(4):819–840, 2002

2002

[5] [5]

https://agentskills.io, 2025

Anthropic.Agent Skills: Composable, Model-Read Procedural Knowledge for LLM Agents. https://agentskills.io, 2025

2025

[6] [6]

Firecrawl Blog, 2026

Firecrawl.How SKILL.md Files Work and Why They’re Everywhere. Firecrawl Blog, 2026. https://www.firecrawl.dev/blog/agent-skills

2026

[7] [7]

LlamaIndex Blog, 2026

LlamaIndex.Files for AI Agents: Context, Search, Skills Guide. LlamaIndex Blog, 2026. https://www.llamaindex.ai/blog/files-are-all-you-need

2026

[8] [8]

Recursive Language Models

Anonymous.Recursive Language Models. arXiv:2512.24601, 2025. https://arxiv.org/abs/ 2512.24601

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Blog post, 2025

Zhang, A.Recursive Language Models. Blog post, 2025. https://alexzhang13.github.io/ blog/2025/rlm/

2025

[10] [10]

et al.LangGraph: Building Stateful, Multi-Actor Applications with LLMs

Chase, H. et al.LangGraph: Building Stateful, Multi-Actor Applications with LLMs. LangChain, 2024.https://langchain-ai.github.io/langgraph/

2024

[11] [11]

ICLR, 2023

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023

2023

[12] [12]

EACL Industry Track, 2026.https://aclanthology.org/2026.eacl-industry

[Redacted for blind review].Benchmarking Customer Support LLM Agents for Business- Adherence. EACL Industry Track, 2026.https://aclanthology.org/2026.eacl-industry. 15.pdf

2026

[13] [13]

Toloka AI Blog, 2026

Toloka AI.TAU-bench extension: benchmarking policy-aware agents in realistic settings. Toloka AI Blog, 2026. https://toloka.ai/blog/ tau-bench-extension-benchmarking-policy-aware-agents-in-realistic-settings/

2026

[14] [14]

τ 3-Bench: Advancing Agent Benchmarking to Knowledge and Voice

Sierra Research. τ 3-Bench: Advancing Agent Benchmarking to Knowledge and Voice. Sierra Blog, 2026. https://sierra.ai/blog/ bench-advancing-agent-benchmarking-to-knowledge-and-voice

2026

[15] [15]

Communications of the ACM, 5(11):558– 562, 1962

Kahn, A.B.Topological sorting of large networks. Communications of the ACM, 5(11):558– 562, 1962. Appendix 9.1 Running the Experiments A pilot run (5 tasks, 2 conditions) and full experiment (97 tasks, 4 conditions) are available via the project Makefile: 1# 5 - task pilot 2make pilot 3 4# Full 97 - task e x p e r i m e n t 5make e x p e r i m e n t 14 9....

1962

[16] [16]

Request credit limit increase

[17] [17]

\n < task_queue >\ n

File t r a n s a c t i o n dispute 4E ND _T AS KS The parsed list is stored in state.pending tasks and injected into the EXECUTION phase instruction as a<task queue>hint: 1q u e u e _ h i n t = ( 2" \n < task_queue >\ n " 3f " Pending : ␣ { ’ , ␣ ’. join ( f ’{ i +1}. ␣ { t } ’ ␣ for ␣i , ␣ t ␣ in ␣ en um er at e ( state . p e n d i n g _ t a s k s ) ) }\...