ContractBench shows that LLM agents frequently violate observation contracts by using expired artifacts or corrupting their byte integrity, with no model exceeding 80% success and notable scaling irregularities across families.
The Twelfth International Conference on Learning Representations , year =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.SE 2years
2026 2representative citing papers
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
citing papers explorer
-
ContractBench: Can LLM Agents Preserve Observation Contracts?
ContractBench shows that LLM agents frequently violate observation contracts by using expired artifacts or corrupting their byte integrity, with no model exceeding 80% success and notable scaling irregularities across families.
-
An Executable Benchmarking Suite for Tool-Using Agents
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.