ContractBench: Can LLM Agents Preserve Observation Contracts?

arxiv: 2605.17281 · v1 · pith:IYJKHMX7new · submitted 2026-05-17 · 💻 cs.SE · cs.AI

ContractBench: Can LLM Agents Preserve Observation Contracts?

Jicheng Wang , Yifeng He , Zili Wang , Hanwen Xing , Arkaprava De , Hao Chen This is my paper

Pith reviewed 2026-05-19 22:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM agentsobservation contractstool useAPI compliancebenchmarkvalidity failuresintegrity failuresscaling behavior

0 comments p. Extension

pith:IYJKHMX7 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{IYJKHMX7}

Prints a linked pith:IYJKHMX7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

LLM agents must preserve observation contracts like tokens and presigned URLs, yet current models routinely fail at this separate capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ContractBench to test whether tool-augmented LLM agents keep observation contracts intact, meaning they respect time limits and exact byte content of API outputs such as session tokens or OAuth parameters. It evaluates 38 models across 33 dual-axis tasks that separately check for validity failures (using expired artifacts) and integrity failures (corrupting bytes during processing). No model reaches 80 percent success, with the highest score at 77.8 percent, and the results show that general tool-use skill does not ensure contract compliance. Certain scaling patterns and post-training steps can even reduce performance, while labeling failures from real API rules can serve as an in-context signal to improve results.

Core claim

Observation contract compliance is an emergent, regression-prone capability that is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. ContractBench measures this through 33 tasks that probe validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes), with deterministic evaluation via a virtual clock and SHA-256 hashes. Across 38 models, none exceed 80 percent accuracy and some families show sharp cliffs or non-monotonic drops tied to agentic post-training.

What carries the argument

ContractBench, a benchmark of 33 dual-axis tasks that separately test temporal validity and byte-level integrity of observation contracts using a virtual clock and hash verification.

If this is right

General tool-use training is insufficient and targeted restraint training is needed to avoid using artifacts after expiry or altering their content.
Within-model-family scaling can produce sudden jumps in compliance once a size threshold allows mid-trajectory checking rather than immediate action.
Agentic post-training that emphasizes helpfulness can introduce sycophancy-driven regressions that lower contract adherence.
Feeding the failure taxonomy back as an in-context reward signal raises performance on previously failed cases by several points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real deployments may need explicit contract-tracking layers outside the model to catch validity and integrity errors the LLM misses.
The observed within-family cliffs suggest that future scaling studies should measure contract compliance as a distinct axis alongside tool competence.
Extending the benchmark to multi-step agent trajectories with chained contracts could reveal whether current failures compound over longer sessions.

Load-bearing premise

The 33 dual-axis tasks and their failure labels drawn from real-world API specifications sufficiently capture the observation-contract compliance problem that arises in deployed tool-augmented agents.

What would settle it

A production agent that scores above 80 percent on ContractBench yet frequently uses expired presigned URLs or corrupted tokens in live API calls would show the benchmark does not track the real capability.

Figures

Figures reproduced from arXiv: 2605.17281 by Arkaprava De, Hanwen Xing, Hao Chen, Jicheng Wang, Yifeng He, Zili Wang.

**Figure 1.** Figure 1: CONTRACTBENCH is the first deterministic benchmark for LLM-agent observation-contract compliance, spanning 33 tasks across two orthogonal axes (temporal validity and byte-level integrity) and eight real-world API contract domains. 2 Observation Contracts Production agents repeatedly hit the same class of failure: a tool returns an artifact (an intermediate value such as a presigned URL, a signed token, or … view at source ↗

**Figure 2.** Figure 2: Two orthogonal predicates define an observation contract. Integrity (SHA-256(payload) = expected) and validity (tfetch ≤tissue+τ ) must both hold (Definition 2.2); the four quadrants instantiate Proposition 1. Quadrant area encodes CONTRACTBENCH’s task distribution. Two-level taxonomy. We organize tasks using a two-level taxonomy, as depicted in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Within-family scaling on CONTRACTBENCH: at fixed parameter count, the Base→Instruct gap (post-training delta) is family-specific, with Qwen 3.5 reaching 56.6 % already at 9 B-Instruct. 4.3 Post-Training Regression: A V-Shape in GPT-5 In Finding 3 we observe that contract compliance emerges from post-training. Holding the pretrained base fixed, we further investigate the converse question: how does scaling … view at source ↗

**Figure 4.** Figure 4: A V-shape performance within the GPT-5 family with shared pretrained base: a structured, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: reproduces the four files that make up a single CONTRACTBENCH task, using harbor/tasks/presigned-url-download/ as a representative example. The prose summary is in [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: ). The variance is not measurement noise: in every variable cell, the variance comes from genuine stochasticity in the agent’s strategy under tight TTL or rate-limit pressure, not from validator non-determinism (the virtual clock + SHA-256 trace hash hold all else fixed) [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Per-task pass-rate heatmap: frontier proprietary models × 33 tasks. The Hybrid block reveals the universally-hard tasks (e.g., multi-turn-recall is red across every row). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Per-task pass-rate heatmap: open-source SOTA × 33 tasks. Same color scale and family grouping as Appendix [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Capability ladder of failure-mode profiles, ordered by SR. The dominant failure type shifts [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Full per-model failure-label profile (stacked bars, all 15 complete- or near-complete [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

read the original abstract

Tool-augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte-level integrity) is an emergent, regression-prone capability: it is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. To measure this, we introduce ContractBench, a benchmark of 33 dual-axis tasks that probe two orthogonal failure modes no existing benchmark evaluates: validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes through the observation-to-action pipeline). Our evaluation is deterministic and programmatic, with a virtual clock controlling time and SHA-256 hashes verifying byte integrity. We assign each outcome a failure label drawn from real-world API specifications. We evaluate 38 models and report four findings: (i) no evaluated model clears 80%, with Claude-Opus-4.6 leading at 77.8%, revealing that current frontier models still fail to comply with observation contracts; (ii) a sharp within-family capability cliff in Qwen 3.5 between 4B (0%) and 9B (56.6%), smoothing to 70.7% at 397B-A17B: what emerges across the cliff is mid-trajectory restraint, not tool-call competence; (iii) non-monotonic scaling across the GPT-5 family: agentic post-training can erode compliance through sycophancy-driven regression; (iv) our failure taxonomy works as an actionable in-context reward signal, yielding +7.1 pp on 42 paired GPT-5.1 failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ContractBench, a benchmark of 33 dual-axis tasks that evaluate LLM agents on preserving observation contracts (artifacts such as presigned URLs, tokens, and OAuth parameters whose later use is constrained by temporal validity and byte integrity). Using deterministic evaluation with a virtual clock and SHA-256 checks, the authors test 38 models and report that no model exceeds 80% compliance (Claude-Opus-4.6 leads at 77.8%), identify within-family scaling cliffs (e.g., Qwen 3.5) and non-monotonic behavior (GPT-5 family), and show that their failure taxonomy can serve as an in-context reward yielding +7.1 pp improvement.

Significance. If the 33 tasks validly isolate observation-contract compliance from general agent robustness issues, the work would usefully document a gap in current frontier models for realistic tool-augmented deployment. The deterministic, programmatic evaluation protocol (virtual clock plus hash verification) and the demonstration that the taxonomy functions as an actionable reward signal are concrete strengths that support reproducibility and potential follow-on work.

major comments (3)

[§3] §3 (Benchmark Construction): The claim that the 33 tasks cleanly probe validity and integrity failures rests on the assertion that failure labels are drawn from real-world API specifications, yet the manuscript provides no coverage analysis, inter-annotator validation, or explicit check that the dual-axis design separates contract-specific errors from confounding factors such as multi-turn state tracking or general planning failures.
[§4.2] §4.2 (Prompting and Contract Conveyance): The evaluation protocol is described as deterministic, but the paper does not report the exact prompting templates used to communicate contract constraints (expiry times, integrity requirements) to the agents; without this, it is unclear whether observed failures reflect inability to preserve contracts or simply failure to receive or parse the constraints in the first place.
[§5.3] §5.3 (Non-monotonic Scaling Claim): The attribution of GPT-5 family regression to sycophancy-driven erosion of compliance is load-bearing for the 'regression-prone' characterization, yet the manuscript supplies only aggregate scores rather than paired before/after examples or ablation showing that post-training specifically increases contract violations while preserving other tool-use metrics.

minor comments (2)

[Figures/Tables] Figure 2 and Table 4: axis labels and legend entries use inconsistent abbreviations for model families; expanding them would improve readability.
[§2] §2 (Related Work): The discussion of prior agent benchmarks could more explicitly contrast ContractBench's focus on observation contracts with existing tool-use suites that emphasize planning or API selection.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and indicate where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The claim that the 33 tasks cleanly probe validity and integrity failures rests on the assertion that failure labels are drawn from real-world API specifications, yet the manuscript provides no coverage analysis, inter-annotator validation, or explicit check that the dual-axis design separates contract-specific errors from confounding factors such as multi-turn state tracking or general planning failures.

Authors: We acknowledge the value of additional documentation on task construction. The 33 tasks were derived by enumerating common observation-contract patterns from public API specifications (e.g., AWS S3 presigned URLs, OAuth 2.0 state parameters, and time-limited session tokens). In the revision we will add a dedicated subsection to §3 that (i) lists the specific API sources and the coverage they provide across validity and integrity axes, (ii) describes the programmatic generation process that enforces the dual-axis separation, and (iii) reports results on a set of control tasks without contract constraints to demonstrate that general planning or state-tracking failures do not dominate the measured outcomes. Because the failure labels are assigned by direct reference to the cited specifications rather than by human annotation, inter-annotator agreement statistics are not applicable; we will explicitly note this design choice. revision: yes
Referee: [§4.2] §4.2 (Prompting and Contract Conveyance): The evaluation protocol is described as deterministic, but the paper does not report the exact prompting templates used to communicate contract constraints (expiry times, integrity requirements) to the agents; without this, it is unclear whether observed failures reflect inability to preserve contracts or simply failure to receive or parse the constraints in the first place.

Authors: We agree that the precise wording used to convey constraints is necessary for full reproducibility. Although the downstream evaluation (virtual clock and SHA-256 verification) is deterministic, the prompt-level communication of expiry and integrity rules can influence agent behavior. In the revised manuscript we will include the complete prompting templates in an appendix, showing the exact phrasing for temporal constraints (“use this artifact before timestamp T”) and integrity constraints (“do not alter any byte of the provided value”). This addition will allow readers to confirm that the constraints were explicitly and consistently stated. revision: yes
Referee: [§5.3] §5.3 (Non-monotonic Scaling Claim): The attribution of GPT-5 family regression to sycophancy-driven erosion of compliance is load-bearing for the 'regression-prone' characterization, yet the manuscript supplies only aggregate scores rather than paired before/after examples or ablation showing that post-training specifically increases contract violations while preserving other tool-use metrics.

Authors: The non-monotonic pattern is directly visible in the per-variant aggregate scores we report. We interpret the regression as consistent with sycophancy effects documented in the post-training literature for agentic models. While we cannot release proprietary intermediate checkpoints for a controlled ablation, we will add representative failure traces from the GPT-5 family evaluations to §5.3, illustrating the specific contract violations that increase after the post-training stage. These qualitative examples, together with the quantitative drop, provide concrete support for the regression-prone characterization. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

This paper is a pure empirical benchmark study that constructs 33 tasks from real-world API specifications and measures model compliance via deterministic programmatic checks (virtual clock for validity, SHA-256 for integrity). The reported scores and four findings are obtained by direct execution on 38 models; no equations, fitted parameters, self-referential predictions, or derivations are present that could reduce to the inputs by construction. The work contains no load-bearing self-citations, uniqueness theorems, or ansatzes, rendering the evaluation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical benchmark without introducing new free parameters, axioms, or postulated entities; it relies on standard practices in LLM evaluation and real-world API specifications.

pith-pipeline@v0.9.0 · 5864 in / 1086 out tokens · 30510 ms · 2026-05-19T22:56:39.462455+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

observation contracts... temporal validity and byte-level integrity... virtual clock... SHA-256 hashes
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

failure taxonomy... validity failures... integrity failures

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

[1]

2026 , eprint =

Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation , author =. 2026 , eprint =

work page 2026
[2]

2026 , eprint =

Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception , author =. 2026 , eprint =

work page 2026
[3]

and Van Horn, Earl C

Dennis, Jack B. and Van Horn, Earl C. , title =. Commun. ACM , month = mar, pages =. 1966 , issue_date =. doi:10.1145/365230.365252 , abstract =

work page doi:10.1145/365230.365252 1966
[4]

2024 , url =

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle =. 2024 , url =

work page 2024
[5]

AgentBench: Evaluating

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , booktitle =. AgentBench: Evalua...

work page 2024
[6]

Transactions on Machine Learning Research , issn =

Inverse Scaling: When Bigger Isn't Better , author =. Transactions on Machine Learning Research , issn =. 2023 , url =

work page 2023
[7]

Transactions of the Association for Computational Linguistics , volume =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , address =. doi:10.1162/tacl_a_00638 , pages =

work page doi:10.1162/tacl_a_00638 2024
[8]

Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun , booktitle =. Tool. 2024 , url =

work page 2024
[9]

Thirty-seventh Conference on Neural Information Processing Systems , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =

work page
[10]

2025 , eprint =

Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination , author =. 2025 , eprint =

work page 2025
[11]

Transactions on Machine Learning Research , issn =

Emergent Abilities of Large Language Models , author =. Transactions on Machine Learning Research , issn =. 2022 , url =

work page 2022
[12]

2026 , eprint =

Real-Time Deadlines Reveal Temporal Awareness Failures in LLM Strategic Dialogues , author =. 2026 , eprint =

work page 2026
[13]

2024 , booktitle =

Xie, Jian and Zhang, Kai and Chen, Jiangjie and Zhu, Tinghui and Lou, Renze and Tian, Yuandong and Xiao, Yanghua and Su, Yu , title =. 2024 , booktitle =

work page 2024
[14]

2023 , url =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

work page 2023
[15]

2024 , eprint =

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning , author =. 2024 , eprint =

work page 2024
[16]

2025 , eprint =

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning , author =. 2025 , eprint =

work page 2025
[17]

The Twelfth International Conference on Learning Representations , year =

WebArena: A Realistic Web Environment for Building Autonomous Agents , author =. The Twelfth International Conference on Learning Representations , year =

work page
[18]

2025 , note =

Measuring what Matters: Construct Validity in Large Language Model Benchmarks , author =. 2025 , note =. 2511.04703 , archiveprefix =

work page arXiv 2025
[19]

2025 , booktitle =

Position: Medical Large Language Model Benchmarks Should Prioritize Construct Validity , author =. 2025 , booktitle =. 2503.10694 , archiveprefix =

work page arXiv 2025
[20]

2025 , url =

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik R Narasimhan , booktitle =. 2025 , url =

work page 2025
[21]

The Fourteenth International Conference on Learning Representations , year =

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author =. The Fourteenth International Conference on Learning Representations , year =

work page
[22]

2026 , eprint =

ClawBench: Can AI Agents Complete Everyday Online Tasks? , author =. 2026 , eprint =

work page 2026
[23]

Security of AI Agents , year =

He, Yifeng and Wang, Ethan and Rong, Yuyang and Cheng, Zifei and Chen, Hao , booktitle =. Security of AI Agents , year =

work page
[24]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , title =

Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

work page 2024
[25]

Proceedings of the 37th International Conference on Neural Information Processing Systems , year =

Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , year =

work page
[26]

Proceedings of the 37th International Conference on Neural Information Processing Systems , year =

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , title =. Proceedings of the 37th Internatio...

work page
[27]

and Hardt, D

Jones, M. and Hardt, D. , title =. 2012 , month = oct, url =

work page 2012
[28]

and Fielding, R

Nottingham, M. and Fielding, R. , title =. 2012 , month = apr, url =

work page 2012
[29]

and Nottingham, M

Fielding, R. and Nottingham, M. and Reschke, J. , title =. 2022 , month = jun, url =

work page 2022
[30]

and Reschke, J

Fielding, R. and Reschke, J. , title =. 2014 , month = jun, url =

work page 2014
[31]

, title =

Hardt, D. , title =. 2012 , month = oct, url =

work page 2012
[32]

and Fielding, R

Berners-Lee, T. and Fielding, R. and Masinter, L. , title =. 2005 , month = jan, url =

work page 2005
[33]

and Richer, J

Backman, A. and Richer, J. and Sporny, M. , title =. 2024 , month = feb, url =

work page 2024
[34]

and Pardue, L

Polli, R. and Pardue, L. , title =. 2024 , month = feb, url =

work page 2024
[35]

2015 , month = aug, url =

Secure Hash Standard (. 2015 , month = aug, url =

work page 2015
[36]

Proceedings of the 1987

Garcia-Molina, Hector and Salem, Kenneth , title =. Proceedings of the 1987. 1987 , pages =. doi:10.1145/38713.38742 , url =

work page doi:10.1145/38713.38742 1987
[37]

2020 , url =

Krakovna, Victoria and Uesato, Jonathan and Mikulik, Vladimir and Rahtz, Matthew and Everitt, Tom and Kumar, Ramana and Kenton, Zac and Leike, Jan and Legg, Shane , title =. 2020 , url =

work page 2020
[38]

2025 , month = nov, url =

work page 2025
[39]

2025 , month = nov, url =

Introducing. 2025 , month = nov, url =

work page 2025
[40]

2026 , month = apr, url =

Introducing. 2026 , month = apr, url =

work page 2026
[41]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li and Wenbo Chen and Yimin Liu and Shenghan Zheng and Xiaokun Chen and Yifeng He and Yubo Li and Bingran You and Haotian Shen and Jiankai Sun and Shuyi Wang and Binxu Li and Qunhong Zeng and Di Wang and Xuandong Zhao and Yuanli Wang and Roey Ben Chaim and Zonglin Di and Yipeng Gao and Junwei He and Yizhuo He and Liqiang Jing and Luyang Kong and X...

work page internal anchor Pith review Pith/arXiv arXiv
[42]

2025 , url =

Claude. 2025 , url =

work page 2025
[43]

2026 , url =

Update to. 2026 , url =

work page 2026
[44]

InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , publisher =. doi:10.1145/3600006.3613165 , url =

work page doi:10.1145/3600006.3613165 2023

[1] [1]

2026 , eprint =

Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation , author =. 2026 , eprint =

work page 2026

[2] [2]

2026 , eprint =

Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception , author =. 2026 , eprint =

work page 2026

[3] [3]

and Van Horn, Earl C

Dennis, Jack B. and Van Horn, Earl C. , title =. Commun. ACM , month = mar, pages =. 1966 , issue_date =. doi:10.1145/365230.365252 , abstract =

work page doi:10.1145/365230.365252 1966

[4] [4]

2024 , url =

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle =. 2024 , url =

work page 2024

[5] [5]

AgentBench: Evaluating

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , booktitle =. AgentBench: Evalua...

work page 2024

[6] [6]

Transactions on Machine Learning Research , issn =

Inverse Scaling: When Bigger Isn't Better , author =. Transactions on Machine Learning Research , issn =. 2023 , url =

work page 2023

[7] [7]

Transactions of the Association for Computational Linguistics , volume =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , address =. doi:10.1162/tacl_a_00638 , pages =

work page doi:10.1162/tacl_a_00638 2024

[8] [8]

Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun , booktitle =. Tool. 2024 , url =

work page 2024

[9] [9]

Thirty-seventh Conference on Neural Information Processing Systems , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =

work page

[10] [10]

2025 , eprint =

Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination , author =. 2025 , eprint =

work page 2025

[11] [11]

Transactions on Machine Learning Research , issn =

Emergent Abilities of Large Language Models , author =. Transactions on Machine Learning Research , issn =. 2022 , url =

work page 2022

[12] [12]

2026 , eprint =

Real-Time Deadlines Reveal Temporal Awareness Failures in LLM Strategic Dialogues , author =. 2026 , eprint =

work page 2026

[13] [13]

2024 , booktitle =

Xie, Jian and Zhang, Kai and Chen, Jiangjie and Zhu, Tinghui and Lou, Renze and Tian, Yuandong and Xiao, Yanghua and Su, Yu , title =. 2024 , booktitle =

work page 2024

[14] [14]

2023 , url =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

work page 2023

[15] [15]

2024 , eprint =

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning , author =. 2024 , eprint =

work page 2024

[16] [16]

2025 , eprint =

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning , author =. 2025 , eprint =

work page 2025

[17] [17]

The Twelfth International Conference on Learning Representations , year =

WebArena: A Realistic Web Environment for Building Autonomous Agents , author =. The Twelfth International Conference on Learning Representations , year =

work page

[18] [18]

2025 , note =

Measuring what Matters: Construct Validity in Large Language Model Benchmarks , author =. 2025 , note =. 2511.04703 , archiveprefix =

work page arXiv 2025

[19] [19]

2025 , booktitle =

Position: Medical Large Language Model Benchmarks Should Prioritize Construct Validity , author =. 2025 , booktitle =. 2503.10694 , archiveprefix =

work page arXiv 2025

[20] [20]

2025 , url =

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik R Narasimhan , booktitle =. 2025 , url =

work page 2025

[21] [21]

The Fourteenth International Conference on Learning Representations , year =

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author =. The Fourteenth International Conference on Learning Representations , year =

work page

[22] [22]

2026 , eprint =

ClawBench: Can AI Agents Complete Everyday Online Tasks? , author =. 2026 , eprint =

work page 2026

[23] [23]

Security of AI Agents , year =

He, Yifeng and Wang, Ethan and Rong, Yuyang and Cheng, Zifei and Chen, Hao , booktitle =. Security of AI Agents , year =

work page

[24] [24]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , title =

Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

work page 2024

[25] [25]

Proceedings of the 37th International Conference on Neural Information Processing Systems , year =

Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , year =

work page

[26] [26]

Proceedings of the 37th International Conference on Neural Information Processing Systems , year =

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , title =. Proceedings of the 37th Internatio...

work page

[27] [27]

and Hardt, D

Jones, M. and Hardt, D. , title =. 2012 , month = oct, url =

work page 2012

[28] [28]

and Fielding, R

Nottingham, M. and Fielding, R. , title =. 2012 , month = apr, url =

work page 2012

[29] [29]

and Nottingham, M

Fielding, R. and Nottingham, M. and Reschke, J. , title =. 2022 , month = jun, url =

work page 2022

[30] [30]

and Reschke, J

Fielding, R. and Reschke, J. , title =. 2014 , month = jun, url =

work page 2014

[31] [31]

, title =

Hardt, D. , title =. 2012 , month = oct, url =

work page 2012

[32] [32]

and Fielding, R

Berners-Lee, T. and Fielding, R. and Masinter, L. , title =. 2005 , month = jan, url =

work page 2005

[33] [33]

and Richer, J

Backman, A. and Richer, J. and Sporny, M. , title =. 2024 , month = feb, url =

work page 2024

[34] [34]

and Pardue, L

Polli, R. and Pardue, L. , title =. 2024 , month = feb, url =

work page 2024

[35] [35]

2015 , month = aug, url =

Secure Hash Standard (. 2015 , month = aug, url =

work page 2015

[36] [36]

Proceedings of the 1987

Garcia-Molina, Hector and Salem, Kenneth , title =. Proceedings of the 1987. 1987 , pages =. doi:10.1145/38713.38742 , url =

work page doi:10.1145/38713.38742 1987

[37] [37]

2020 , url =

Krakovna, Victoria and Uesato, Jonathan and Mikulik, Vladimir and Rahtz, Matthew and Everitt, Tom and Kumar, Ramana and Kenton, Zac and Leike, Jan and Legg, Shane , title =. 2020 , url =

work page 2020

[38] [38]

2025 , month = nov, url =

work page 2025

[39] [39]

2025 , month = nov, url =

Introducing. 2025 , month = nov, url =

work page 2025

[40] [40]

2026 , month = apr, url =

Introducing. 2026 , month = apr, url =

work page 2026

[41] [41]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li and Wenbo Chen and Yimin Liu and Shenghan Zheng and Xiaokun Chen and Yifeng He and Yubo Li and Bingran You and Haotian Shen and Jiankai Sun and Shuyi Wang and Binxu Li and Qunhong Zeng and Di Wang and Xuandong Zhao and Yuanli Wang and Roey Ben Chaim and Zonglin Di and Yipeng Gao and Junwei He and Yizhuo He and Liqiang Jing and Luyang Kong and X...

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

2025 , url =

Claude. 2025 , url =

work page 2025

[43] [43]

2026 , url =

Update to. 2026 , url =

work page 2026

[44] [44]

InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , publisher =. doi:10.1145/3600006.3613165 , url =

work page doi:10.1145/3600006.3613165 2023