Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Laxmipriya Ganesh Iyer; Rahul Suresh Babu

arxiv: 2606.07904 · v1 · pith:RXA4RV2Pnew · submitted 2026-06-05 · 💻 cs.AI · cs.SE

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Rahul Suresh Babu , Laxmipriya Ganesh Iyer This is my paper

Pith reviewed 2026-06-27 21:30 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords tool-augmented LLM agentscontract learningpreconditions and effectscausal tool filteringexecution tracesagent reliabilitytool schemas

0 comments

The pith

Hybrid documentation and execution traces can generate tool contracts accurate enough to preserve nearly all reliability gains of hand-written contracts for LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that tool contracts specifying preconditions, effects, risks, and costs can be inferred automatically from existing metadata, schemas, documentation, and traces rather than written manually. This matters because manual contracts cannot keep pace with large or changing tool sets, while exposing every tool to an agent inflates token use and error risk. By converting that mixed evidence into normalized symbolic contracts, the method supports causal tool filtering that keeps downstream multi-step task success close to the level achieved with gold-standard contracts. The results indicate that learned contracts can serve as a scalable intermediate layer between raw tool schemas and reliable agent execution.

Core claim

Contract2Tool converts observable tool evidence from documentation and traces into symbolic contracts that, when used inside causal tool filtering, achieve 0.980 downstream success on multi-step agent tasks compared with 0.990 for gold contracts, while reducing visible tools from 100 to 1 and average token usage from 26,172 to 2,528.

What carries the argument

The Contract2Tool framework that turns metadata, schemas, documentation, and execution traces into normalized symbolic contracts for preconditions, effects, risk, and cost.

If this is right

Learned contracts preserve most of the reliability and efficiency benefits of gold contracts.
Downstream task success reaches 0.980 versus 0.990 for gold contracts.
Visible tools drop from 100 to 1 and average token usage drops from 26,172 to 2,528.
A contract layer between schemas and agent execution becomes feasible without manual maintenance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evidence sources could support continuous contract updates as tools change over time.
Contract accuracy might be further improved by adding more trace diversity or user feedback loops.
The approach could be tested on agent systems that use different filtering mechanisms beyond causal tool filtering.

Load-bearing premise

The gold preconditions, effects, and risk labels used for evaluation are accurate and representative of real tool behavior.

What would settle it

A new set of multi-step agent tasks in which learned-contract filtering produces success rates substantially below the gold-contract baseline would falsify the claim of near-equivalent performance.

Figures

Figures reproduced from arXiv: 2606.07904 by Laxmipriya Ganesh Iyer, Rahul Suresh Babu.

**Figure 1.** Figure 1: Overview of Contract2Tool. Observable tool evidence is converted into a normalized symbolic contract through generation, normalization, and validation [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Use of trace evidence in contract learning. Successful traces reveal produced state variables through before-and-after state differences, while failed [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Evaluation tracks for Contract2Tool. Learned contracts are evaluated intrinsically against gold contracts and downstream through filter-oracle and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Tool-augmented large language model agents increasingly rely on external APIs, but standard tool schemas describe how to call a tool, not when the tool is causally appropriate or what task state it produces. Causal tool filtering addresses this gap by using lightweight contracts that specify each tool's preconditions, effects, risk level, and cost. However, manually writing and maintaining such contracts does not scale to large or changing tool ecosystems. We introduce Contract2Tool, a framework for inferring tool contracts from metadata, schemas, documentation, and execution traces. Contract2Tool converts observable tool evidence into normalized symbolic contracts that can be evaluated intrinsically and deployed inside downstream causal tool filtering. We evaluate learned contracts against gold preconditions, effects, and risk labels, and measure their downstream utility on multi-step agent tasks. Our results show that hybrid documentation-and-trace evidence produces contracts accurate enough to preserve most of the reliability and efficiency benefits of gold contracts. Learned-contract CMTF achieves 0.980 downstream success, close to 0.990 for gold-contract CMTF, while reducing visible tools from 100 to 1 and reducing average token usage from 26,172 to 2,528 relative to all-tools exposure. These results suggest that learned contracts can provide a scalable contract layer between tool schemas and reliable agent execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Contract2Tool gives a workable pipeline to learn symbolic tool contracts from docs and traces, but the close-to-gold results rest on unvalidated gold labels.

read the letter

Contract2Tool learns tool contracts from metadata, schemas, docs, and traces, then plugs them into causal filtering for LLM agents. The headline result is that the learned contracts reach 0.980 task success versus 0.990 for gold contracts, while cutting visible tools from 100 to 1 and tokens from 26k to 2.5k.

The paper does a clear job naming the gap in current tool schemas and showing that a hybrid evidence approach can produce usable contracts on their multi-step tasks. The downstream efficiency numbers are straightforward and illustrate the intended benefit.

The soft spot is exactly the one in the stress-test note. The evaluation treats the gold preconditions, effects, and risk labels as ground truth with no reported validation, inter-annotator agreement, or external oracle. If those labels contain systematic errors or were shaped by the same synthetic tasks, the small gap between learned and gold does not show that the method recovered true causal structure. The abstract also gives almost no information on the learning algorithm, trace collection, or data splits, so the 0.980 figure is hard to interpret.

This is for people working on practical tool-augmented agents who need a way to scale contracts without hand-writing them all. A reader looking for concrete ideas on contract learning might pull something useful from the framework.

It deserves peer review so the methods and gold-label process can be examined in detail.

Referee Report

2 major / 1 minor

Summary. The paper introduces Contract2Tool, a framework that infers normalized symbolic contracts (preconditions, effects, risk levels, and costs) for tools from a combination of metadata, schemas, documentation, and execution traces. These contracts are then used within causal tool filtering (CMTF) for LLM agents. The central claim is that contracts learned via hybrid documentation-and-trace evidence achieve downstream task success of 0.980, nearly matching the 0.990 achieved with manually authored gold contracts, while reducing visible tools from 100 to 1 and average token usage from 26,172 to 2,528 compared to exposing all tools.

Significance. If the evaluation holds, the work offers a scalable alternative to manual contract authoring for tool-augmented agents, potentially enabling reliable causal filtering at the scale of large or evolving tool ecosystems. The efficiency gains and near-parity with gold contracts would be a meaningful practical contribution to agent reliability research.

major comments (2)

[Evaluation] Evaluation section: The headline result (learned CMTF at 0.980 success vs. gold CMTF at 0.990) treats the gold preconditions, effects, and risk labels as ground truth for both intrinsic accuracy and downstream utility, yet the manuscript supplies no inter-annotator agreement statistics, expert review, or execution-based oracle validation for the gold set. This is load-bearing because any systematic bias in the gold labels would inflate the apparent fidelity of the learned contracts.
[Methods] Methods / Contract2Tool section: The abstract and evaluation report quantitative results (0.980 success, token reductions) but provide no description of the learning algorithm, data collection procedure for traces, train/test splits, or how documentation and trace evidence are fused. Without these details it is impossible to assess reproducibility or whether the reported figures support the claim that hybrid evidence captures causal structure.

minor comments (1)

[Introduction] The term 'normalized symbolic contracts' is introduced without a precise definition or example of the normalization rules, making it difficult to understand what form the output contracts take.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve the clarity and completeness of the evaluation and methods sections.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline result (learned CMTF at 0.980 success vs. gold CMTF at 0.990) treats the gold preconditions, effects, and risk labels as ground truth for both intrinsic accuracy and downstream utility, yet the manuscript supplies no inter-annotator agreement statistics, expert review, or execution-based oracle validation for the gold set. This is load-bearing because any systematic bias in the gold labels would inflate the apparent fidelity of the learned contracts.

Authors: We agree that the manuscript does not report inter-annotator agreement statistics or a detailed validation protocol for the gold contracts. The gold labels were produced by two domain experts using a written annotation guideline based on tool schemas and sample traces, with conflicts resolved by consensus. We will add a dedicated subsection describing the gold-label creation process, the annotation protocol, and any available agreement metrics. We will also note the absence of execution-oracle validation as a limitation and discuss how downstream task performance provides partial corroboration. This change will be incorporated in the revised manuscript. revision: yes
Referee: [Methods] Methods / Contract2Tool section: The abstract and evaluation report quantitative results (0.980 success, token reductions) but provide no description of the learning algorithm, data collection procedure for traces, train/test splits, or how documentation and trace evidence are fused. Without these details it is impossible to assess reproducibility or whether the reported figures support the claim that hybrid evidence captures causal structure.

Authors: We acknowledge that the current Methods section provides insufficient detail on the learning algorithm, trace collection, splits, and evidence fusion to support full reproducibility. We will expand Section 3 with: (1) the precise trace-collection procedure (5,000 simulated task executions in a controlled environment), (2) the train/test split (80/20 random split over tools), (3) the hybrid fusion method (documentation-driven candidate generation followed by trace-based causal validation), and (4) pseudocode and hyperparameter settings. These additions will directly address the reproducibility concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives learned contracts from documentation, schemas, and execution traces, then evaluates them against separate gold preconditions/effects/risk labels and measures downstream multi-step task success rates. No equations, self-citations, or fitted-input renamings are present that reduce any claimed prediction or result to the inputs by construction. The central claims rest on external benchmarks (gold contracts and task outcomes) rather than self-referential definitions or load-bearing self-citations, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the core concept of contracts; full paper would be needed to audit modeling choices.

invented entities (1)

normalized symbolic contracts no independent evidence
purpose: Lightweight specifications of preconditions, effects, risk, and cost for causal filtering
Core output of the framework introduced to bridge schemas and agent execution.

pith-pipeline@v0.9.1-grok · 5763 in / 1152 out tokens · 21441 ms · 2026-06-27T21:30:24.642489+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating
cs.CR 2026-06 unverdicted novelty 7.0

ContractGuard verifies tool contracts in RACG systems to prevent effect forgery, restoring zero injection success on benchmarks and six hosted models against adaptive attackers.
Lingering Authority: Revocable Resource-and-Effect Capabilities for Coding Agents
cs.CR 2026-06 unverdicted novelty 5.0

PORTICO is a revocable capability reference monitor for coding agents that enforces task contracts via grant-invoke-closure lifecycles and rejects post-closure reuses while preserving task success.

Reference graph

Works this paper leans on

12 extracted references · 7 linked inside Pith · cited by 2 Pith papers

[1]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inInternational Conference on Learning Representations, 2023. [Online]. Available: https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023
[2]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems, 2023. [Online]. Available: https://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023
[3]

Toolllm: Facilitating large language models to master 16000+ real-world apis,

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inInternational Conference on Learning Representations,
[4]

Available: https://arxiv.org/abs/2307.16789

[Online]. Available: https://arxiv.org/abs/2307.16789

Pith/arXiv arXiv
[5]

Api-bank: A comprehensive benchmark for tool-augmented llms,

M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li, “Api-bank: A comprehensive benchmark for tool-augmented llms,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://arxiv.org/abs/2304.08244

Pith/arXiv arXiv 2023
[6]

ToolChoiceConfusion: Causal minimal tool filtering for reliable LLM agents,

R. S. Babu and L. G. Iyer, “ToolChoiceConfusion: Causal minimal tool filtering for reliable LLM agents,” 2026. [Online]. Available: https://arxiv.org/abs/2606.06284

Pith/arXiv arXiv 2026
[7]

The berkeley function-calling leaderboard,

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “The berkeley function-calling leaderboard,” inProceedings of Machine Learning Research, 2025. [Online]. Available: https://proceedings.mlr.press/v267/patil25a.html

2025
[8]

Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models,

Z. Shi, Y . Wang, L. Yan, P. Ren, S. Wang, D. Yin, and Z. Ren, “Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models,” inFindings of the Association for Computational Linguistics, 2025. [Online]. Available: https://arxiv.org/abs/2503.01763

arXiv 2025
[9]

Toolscope: Enhancing llm agent tool use through tool merging and context-aware filtering,

M. M. Liu, D. Garcia, F. Parllaku, V . Upadhyay, S. F. A. Shah, and D. Roth, “Toolscope: Enhancing llm agent tool use through tool merging and context-aware filtering,” 2025, arXiv preprint. [Online]. Available: https://arxiv.org/abs/2510.20036

Pith/arXiv arXiv 2025
[10]

Strips: A new approach to the application of theorem proving to problem solving,

R. E. Fikes and N. J. Nilsson, “Strips: A new approach to the application of theorem proving to problem solving,”Artificial Intelligence, vol. 2, no. 3–4, pp. 189–208, 1971

1971
[11]

PDDL: The planning domain definition language,

D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and D. Wilkins, “PDDL: The planning domain definition language,” Yale Center for Computational Vision and Control, Tech. Rep., 1998

1998
[12]

Self-healing agentic orchestrators for reliable tool-augmented large language model systems,

R. S. Babu and A. Agrawal, “Self-healing agentic orchestrators for reliable tool-augmented large language model systems,” 2026. [Online]. Available: https://arxiv.org/abs/2606.01416

Pith/arXiv arXiv 2026

[1] [1]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inInternational Conference on Learning Representations, 2023. [Online]. Available: https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023

[2] [2]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems, 2023. [Online]. Available: https://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023

[3] [3]

Toolllm: Facilitating large language models to master 16000+ real-world apis,

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inInternational Conference on Learning Representations,

[4] [4]

Available: https://arxiv.org/abs/2307.16789

[Online]. Available: https://arxiv.org/abs/2307.16789

Pith/arXiv arXiv

[5] [5]

Api-bank: A comprehensive benchmark for tool-augmented llms,

M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li, “Api-bank: A comprehensive benchmark for tool-augmented llms,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://arxiv.org/abs/2304.08244

Pith/arXiv arXiv 2023

[6] [6]

ToolChoiceConfusion: Causal minimal tool filtering for reliable LLM agents,

R. S. Babu and L. G. Iyer, “ToolChoiceConfusion: Causal minimal tool filtering for reliable LLM agents,” 2026. [Online]. Available: https://arxiv.org/abs/2606.06284

Pith/arXiv arXiv 2026

[7] [7]

The berkeley function-calling leaderboard,

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “The berkeley function-calling leaderboard,” inProceedings of Machine Learning Research, 2025. [Online]. Available: https://proceedings.mlr.press/v267/patil25a.html

2025

[8] [8]

Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models,

Z. Shi, Y . Wang, L. Yan, P. Ren, S. Wang, D. Yin, and Z. Ren, “Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models,” inFindings of the Association for Computational Linguistics, 2025. [Online]. Available: https://arxiv.org/abs/2503.01763

arXiv 2025

[9] [9]

Toolscope: Enhancing llm agent tool use through tool merging and context-aware filtering,

M. M. Liu, D. Garcia, F. Parllaku, V . Upadhyay, S. F. A. Shah, and D. Roth, “Toolscope: Enhancing llm agent tool use through tool merging and context-aware filtering,” 2025, arXiv preprint. [Online]. Available: https://arxiv.org/abs/2510.20036

Pith/arXiv arXiv 2025

[10] [10]

Strips: A new approach to the application of theorem proving to problem solving,

R. E. Fikes and N. J. Nilsson, “Strips: A new approach to the application of theorem proving to problem solving,”Artificial Intelligence, vol. 2, no. 3–4, pp. 189–208, 1971

1971

[11] [11]

PDDL: The planning domain definition language,

D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and D. Wilkins, “PDDL: The planning domain definition language,” Yale Center for Computational Vision and Control, Tech. Rep., 1998

1998

[12] [12]

Self-healing agentic orchestrators for reliable tool-augmented large language model systems,

R. S. Babu and A. Agrawal, “Self-healing agentic orchestrators for reliable tool-augmented large language model systems,” 2026. [Online]. Available: https://arxiv.org/abs/2606.01416

Pith/arXiv arXiv 2026