arxiv: 2406.12045 · v1 · submitted 2024-06-17 · 💻 cs.AI · cs.CL

Recognition: no theorem link

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Karthik Narasimhan, Noah Shinn, Pedram Razavi, Shunyu Yao

Pith reviewed 2026-05-11 03:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords language agentsbenchmarkstool useuser interactionpolicy followingreliabilityfunction callingmulti-turn evaluation

0 comments

The pith

Even top language agents succeed on under half of tasks that require ongoing user interaction and strict policy adherence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces τ-bench to test language agents in realistic multi-turn conversations where a simulated user pursues a goal through an agent equipped with domain-specific tools and rules. Evaluation checks whether the final database state matches an annotated target state rather than inspecting individual actions. Experiments with current function-calling models, including GPT-4o, produce success rates below 50 percent and pass^8 scores under 25 percent in the retail domain. These outcomes indicate that agents still lack the consistency and rule-following needed for reliable deployment.

Core claim

τ-bench emulates conversations between a language-model-simulated user and an agent supplied with API tools and policy guidelines in retail and airline domains. The benchmark scores agents by comparing the ending database state to a goal state and introduces the pass^k metric to measure reliability across repeated trials. Results show that even state-of-the-art agents succeed on fewer than half the tasks and remain inconsistent, pointing to the requirement for new methods that improve consistent rule adherence.

What carries the argument

τ-bench, which runs dynamic user-agent conversations and judges success by final database-state match to an annotated goal, together with the pass^k metric that counts the fraction of k independent trials an agent completes correctly.

Load-bearing premise

Language-model users generate conversations and goals that match real human users, and the chosen domains plus policies represent actual deployment settings.

What would settle it

Replacing the simulated users with real humans on the same tasks and observing success rates above 50 percent or pass^8 scores above 25 percent in retail would show whether the reported performance gap depends on the simulation.

read the original abstract

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

τ-bench adds a combined test for agents that talk to users, use tools, and follow policies, with state matching for scoring and a pass^k consistency metric, but the LM-simulated users lack any human validation.

read the letter

The main point is that this paper ships a benchmark for tool-using agents in settings that require ongoing user conversation plus rule following. It evaluates by checking the final database state against a goal and adds a pass^k metric to track how often an agent succeeds across repeated trials. Current models like gpt-4o land below 50 percent success and under 25 percent on pass^8 in the retail domain, which lines up with the abstract claim that reliability remains a gap.

Referee Report

2 major / 2 minor

Summary. The paper introduces τ-bench, a benchmark for tool-using language agents that interact dynamically with simulated users in real-world domains (retail and airline). Agents receive domain-specific API tools and policy guidelines; users are simulated by LMs. Success is measured by comparing the final database state against independently annotated goal states rather than conversation content, and a new pass^k metric quantifies reliability across repeated trials. Experiments report that even GPT-4o succeeds on fewer than 50% of tasks and achieves pass^8 below 25% in retail, indicating that current agents struggle with consistency and rule adherence.

Significance. If the LM-simulated users produce behavior distributionally close to real humans under the same policies, the benchmark would be a valuable addition to the field by filling gaps in existing evaluations that ignore user interaction and domain rules. The state-comparison evaluation protocol and pass^k metric are sound, parameter-free ideas that enable reproducible assessment without subjective judgment of dialogue quality. The work explicitly credits the need for methods that improve agent consistency.

major comments (2)

[§4] §4 (User Simulation): The user prompts and task selection process are described, but no human baseline, inter-annotator agreement with humans, or ablation replacing the LM user with a scripted oracle is reported. This is load-bearing for interpreting the headline result (GPT-4o <50% success, pass^8 <25% in retail) as evidence of real-world limitations rather than simulation artifacts.
[§5.2] §5.2 (Experiments, retail domain): The reported pass^8 <25% relies on the assumption that the 8 trials capture the full distribution of agent behavior; without variance estimates or justification for k=8, the inconsistency claim is difficult to interpret quantitatively.

minor comments (2)

[Abstract] The abstract states that tasks are 'annotated' but does not specify who performs the annotation or how goal states are constructed; this detail belongs in §3 or §4.
[Figure 1] Figure 1 (overview) would benefit from an explicit legend distinguishing the agent, user simulator, and database components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and note planned revisions where appropriate.

read point-by-point responses

Referee: [§4] §4 (User Simulation): The user prompts and task selection process are described, but no human baseline, inter-annotator agreement with humans, or ablation replacing the LM user with a scripted oracle is reported. This is load-bearing for interpreting the headline result (GPT-4o <50% success, pass^8 <25% in retail) as evidence of real-world limitations rather than simulation artifacts.

Authors: We agree that the lack of direct human validation for the LM-simulated users limits how strongly the results can be interpreted as evidence of real-world agent limitations. The use of LM users was chosen to create a scalable, reproducible testbed with fixed policy guidelines that real humans would be unlikely to follow consistently across trials. The evaluation protocol relies on objective final database state comparison rather than subjective dialogue quality to reduce some sources of simulation bias. In the revised manuscript we will expand Section 4 with an explicit limitations paragraph acknowledging the absence of human baselines or inter-annotator agreement and will add an ablation study that replaces the LM user with a simple rule-based scripted oracle on a subset of tasks to verify policy adherence. revision: partial
Referee: [§5.2] §5.2 (Experiments, retail domain): The reported pass^8 <25% relies on the assumption that the 8 trials capture the full distribution of agent behavior; without variance estimates or justification for k=8, the inconsistency claim is difficult to interpret quantitatively.

Authors: We selected k=8 to demonstrate the reliability drop while remaining computationally tractable, since each trial consists of a full multi-turn conversation. To strengthen the quantitative claim, the revised version will report variance or standard deviation of success rates across the eight trials and will include pass^k values for a range of k (e.g., 1–16) so readers can observe the trend. These additions will make the inconsistency results easier to interpret without altering the headline numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark success is measured by direct state comparison to independent annotations

full rationale

The paper defines task success via explicit comparison of final database state against pre-annotated goal states, with no fitted parameters, self-referential definitions, or predictions that reduce to inputs by construction. The pass^k metric is a simple empirical reliability count over repeated trials. No load-bearing self-citations or uniqueness theorems are invoked. The LM-simulated user design is an explicit modeling choice whose fidelity to humans is an external validity question, not a circularity in the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical parameters, axioms, or postulated entities; its contribution is an empirical benchmark whose validity rests on the domain assumption that LM-simulated users are adequate proxies.

axioms (1)

domain assumption Language models can be prompted to simulate realistic user behavior and goals in domain-specific conversations
The entire dynamic-interaction component of the benchmark depends on this premise.

pith-pipeline@v0.9.0 · 5466 in / 1296 out tokens · 41765 ms · 2026-05-11T03:14:30.124913+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
cs.SD 2026-05 accept novelty 8.0

EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
cs.CL 2026-04 unverdicted novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
cs.AI 2026-05 unverdicted novelty 7.0

TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across...
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
cs.LG 2026-05 unverdicted novelty 7.0

GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 conditional novelty 7.0

LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
ProactBench: Beyond What The User Asked For
cs.LG 2026-05 unverdicted novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
FORTIS: Benchmarking Over-Privilege in Agent Skills
cs.AI 2026-05 unverdicted novelty 7.0

FORTIS benchmark shows over-privilege is the norm in LLM agent skill selection and execution, with models reaching for higher-privilege skills and tools than required across ten frontier models and three domains.
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
cs.CL 2026-05 unverdicted novelty 7.0

Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
cs.SE 2026-05 unverdicted novelty 7.0

MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achi...
Switchcraft: AI Model Router for Agentic Tool Calling
cs.AI 2026-05 unverdicted novelty 7.0

Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.
AcademiClaw: When Students Set Challenges for AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
cs.AI 2026-05 unverdicted novelty 7.0

TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
cs.GR 2026-04 unverdicted novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment
cs.AI 2026-04 conditional novelty 7.0

AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment pre...
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
cs.CV 2026-04 unverdicted novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
AutomationBench
cs.AI 2026-04 unverdicted novelty 7.0

AutomationBench is a new benchmark for AI agents on cross-application REST API workflows where even top models score below 10%.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
cs.AI 2026-04 conditional novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
cs.AI 2026-04 unverdicted novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration
cs.AI 2026-04 conditional novelty 7.0

Iterative Reward Calibration with MT-GRPO and GTPO enables effective multi-turn RL for tool-calling agents, raising Tau-Bench success from 63.8% to 66.7% for a 4B model and from 58.0% to 69.5% for a 30B model.
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
cs.AI 2025-06 unverdicted novelty 7.0

τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
cs.AI 2026-05 unverdicted novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
cs.AI 2026-05 unverdicted novelty 6.0

A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
cs.AI 2026-05 unverdicted novelty 6.0

Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
FORTIS: Benchmarking Over-Privilege in Agent Skills
cs.AI 2026-05 unverdicted novelty 6.0

FORTIS benchmark shows that frontier LLM agents routinely select and execute skills with higher privileges than required, with over-privilege being the norm across models and domains.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.
Tool Calling is Linearly Readable and Steerable in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems
cs.AI 2026-05 unverdicted novelty 6.0

ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percen...
Agentic Coding Needs Proactivity, Not Just Autonomy
cs.SE 2026-05 conditional novelty 6.0

Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
cs.AI 2026-05 unverdicted novelty 6.0

Deployment-relevant AI alignment cannot be inferred from model-level evaluations alone, as benchmark audits show missing interaction support and cross-model tests reveal model-dependent scaffold effects.
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
cs.AI 2026-05 unverdicted novelty 6.0

Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
cs.SE 2026-04 unverdicted novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
CL-bench Life: Can Language Models Learn from Real-Life Context?
cs.CL 2026-04 unverdicted novelty 6.0

CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
Human-Guided Harm Recovery for Computer Use Agents
cs.AI 2026-04 conditional novelty 6.0

Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
cs.AI 2026-04 unverdicted novelty 6.0

AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
LinuxArena: A Control Setting for AI Agents in Live Production Software Environments
cs.CR 2026-04 unverdicted novelty 6.0

LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and ...
BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
cs.AI 2026-04 unverdicted novelty 6.0

BankerToolBench is a new open benchmark of end-to-end investment banking workflows developed with 502 bankers; even the best tested model (GPT-5.4) fails nearly half the expert rubric criteria and produces zero client...
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
cs.CL 2026-04 conditional novelty 6.0

A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
cs.AI 2026-04 unverdicted novelty 6.0

Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque ...
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
cs.AI 2026-05 unverdicted novelty 5.0

A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
cs.CL 2026-05 unverdicted novelty 5.0

Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

Intent compilation turns vague human goals into verifiable artifacts, using closure-gap vectors and delegation envelopes to separate open-world agent challenges from closed-world solvers and to benchmark closure fixes...
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
cs.SE 2026-04 unverdicted novelty 5.0

Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
cs.AI 2026-04 unverdicted novelty 5.0

ATBench-Claw and ATBench-Codex extend the ATBench framework by customizing a three-dimensional safety taxonomy for trajectory evaluation in OpenClaw and Codex agent settings.
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
cs.SE 2026-04 unverdicted novelty 5.0

Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
AlphaEval: Evaluating Agents in Production
cs.CL 2026-04 unverdicted novelty 5.0

AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI
cs.AI 2026-04 unverdicted novelty 5.0

LOM-action uses business events to drive ontology-governed graph simulations that generate auditable decisions, reporting 93.82% accuracy and 98.74% tool-chain F1 versus 24-36% F1 for frontier LLMs.
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
cs.AI 2026-04 unverdicted novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
GLM-5: from Vibe Coding to Agentic Engineering
cs.LG 2026-02 unverdicted novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 65 Pith papers · 9 internal anchors

[1]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. URL https://arxiv.org/abs/2204.01691

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Andreas, J

J. Andreas, J. Bufe, D. Burkett, C. Chen, J. Clausman, J. Crawford, K. Crim, J. DeLoach, L. Dorner, J. Eisner, et al. Task-oriented dialogue as dataflow synthesis. Transactions of the Association for Computational Linguistics , 8:556–571, 2020

work page 2020
[3]

Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling

P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gaši´c. Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278, 2018

work page arXiv 2018
[4]

D. Chen, H. Chen, Y . Yang, A. Lin, and Z. Yu. Action-based conversations dataset: A corpus for building more in-depth task-oriented dialogue systems. arXiv preprint arXiv:2104.00783, 2021

work page arXiv 2021
[5]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page 2021
[6]

S. Chen, S. Wiseman, and B. Dhingra. Chatshop: Interactive information seeking with language agents. arXiv preprint arXiv:2404.09911, 2024

work page arXiv 2024
[7]

eun Yoon, Z

S. eun Yoon, Z. He, J. M. Echterhoff, and J. McAuley. Evaluating large language models as generative user simulators for conversational recommendation, 2024

work page 2024
[8]

I. Gür, D. Hakkani-Tür, G. Tür, and P. Shah. User modeling for task oriented dialogues. In 2018 IEEE Spoken Language Technology Workshop (SLT) , pages 900–906, 2018. doi: 10.1109/SLT.2018.8639652

work page doi:10.1109/slt.2018.8639652 2018
[9]

H. He, D. Chen, A. Balakrishnan, and P. Liang. Decoupling strategy and generation in negotiation dialogues. arXiv preprint arXiv:1808.09637, 2018

work page arXiv 2018
[10]

Z. Hu, Y . Feng, A. T. Luu, B. Hooi, and A. Lipani. Unlocking the potential of user feedback: Leveraging large language model as user simulators to enhance dialogue system. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management , CIKM ’23. ACM, Oct. 2023. doi: 10.1145/3583780.3615220. URL http://dx.doi.org/10.1145/ ...

work page doi:10.1145/3583780.3615220 2023
[11]

Metatool benchmark for large language models: Deciding whether to use tools and which to use,

Y . Huang, J. Shi, Y . Li, C. Fan, S. Wu, Q. Zhang, Y . Liu, P. Zhou, Y . Wan, N. Z. Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128, 2023

work page arXiv 2023
[12]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

M. Kim, Y . Jung, D. Lee, and S.-w. Hwang. Plm-based world models for text-based games. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 1324–1341, 2022

work page 2022
[14]

X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review arXiv 2023
[15]

J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023

work page internal anchor Pith review arXiv 2023
[16]

Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817, 2023

work page internal anchor Pith review arXiv 2023
[17]

Schatzmann, D

J. Schatzmann, D. Jurafsky, M. Galley, and D. Trevillian. Evaluating agenda-based user simulation for reinforcement learning of dialogue management. In Speech Communication, volume 47, pages 95–121, 2007

work page 2007
[18]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Shinn, B

N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection, 2023

work page 2023
[20]

T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths. Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427, 2023

work page arXiv 2023
[21]

Q. Wu, G. Bansal, J. Zhang, Y . Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang. On the tool manipulation capability of open-source large language models, 2023. 10

work page 2023
[23]

F. Yan, H. Mao, C. C.-J. Ji, T. Zhang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_ function_calling_leaderboard.html, 2024

work page 2024
[24]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023

work page 2023
[26]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023
[27]

S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In ArXiv, volume 35, pages 20744–20757, preprint

work page
[28]

Zhang, X

E. Zhang, X. Wang, P. Gong, Y . Lin, and J. Mao. Usimagent: Large language models for simulating search users. arXiv preprint arXiv:2403.09142, 2024

work page arXiv 2024
[29]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y . Bisk, D. Fried, U. Alon, et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854, 2023. 11 Checklist

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper con...

work page
[31]

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

work page
[32]

for benchmarks)

If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g.,...

work page
[33]

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent wa...

work page
[34]

products

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wa...

work page 2024
[35]

**Item ID: 7706410293** - Switch Type: Clicky - Backlight: None - Size: Full Size - Price: $269.16

work page
[36]

order_id

**Item ID: 6342039236** - Switch Type: Clicky - Backlight: White - Size: Full Size - Price: $244.91 Unfortunately, there are no full-size models with clicky switches and RGB backlighting currently available. Would you like to proceed with one of the available options, or would you prefer to wait until the desired model is back in stock? 27 user: Understan...

work page
[37]

order_id

**Item ID: 7747408585** - Compatibility: Google Assistant - Color: Black - Price: $249.01 Unfortunately, this is the only available option that matches your requirement. Would you like to proceed with this exchange? user: That item fits the particular requirement I had in mind. Yes, let 's proceed with the exchange for the thermostat. assistant: Here are ...

work page
[38]

**Item ID:** 7661609223 - Capacity: 1000ml - Material: Stainless Steel - Color: Black - Price: $46.51

work page
[39]

**Item ID:** 2439754078 - Capacity: 1000ml - Material: Stainless Steel - Color: Red - Price: $49.51 ### Desk Lamp Options:

work page
[40]

**Item ID:** 5320792178 - Color: Black - Brightness: Medium - Power Source: AC Adapter - Price: $135.24

work page
[41]

order_id

**Item ID:** 1569765161 - Color: Silver - Brightness: Low - Power Source: AC Adapter - Price: $143.02 Please confirm which items you would like to exchange for, and provide the payment method you would like to use for any price difference. user: I would like to exchange the desk lamp for the one with item ID 5320792178. As for the water bottle, I need som...

work page
[42]

order_id

Order ID: #W4082615 Please let me know which order you would like me to check the address for. user: I can 't remember exactly which one, but it 's the jigsaw that I ordered recently. Could we look into that one, please? assistant: Function(arguments= '{"order_id": "#W9583042"} ', name= 'get_order_details') 37 tool: {"order_id": "#W9583042", "user_id": "m...

work page 2000
[43]

**Item ID: 1096508426** - **Theme:** Art - **Difficulty Level:** Beginner - **Price:** $46.13

work page
[44]

order_id

**Item ID: 9237024510** - **Theme:** Animals - **Difficulty Level:** Expert - **Price:** $53.53 Would you like to exchange your current jigsaw puzzle for one of these? If so, please confirm which one and provide a payment method to handle any price difference. user: Could we go with the Art themed one, Item ID: 1096508426? I generally use PayPal for my pu...

work page 1990
[45]

Trip type: One way or round trip?

work page
[46]

Departure date (and return date if it 's a round trip)?

work page
[47]

Cabin class: Basic economy, economy, or business? 46

work page
[48]

Number of passengers and their details (first name, last name, date of birth)?

work page
[49]

Do you want to purchase travel insurance for this trip (30 dollars per passenger)? Once I have this information, I can proceed with searching for available flights. user: 1. It 's a one-way trip

work page
[50]

I want to depart on May 20

work page
[51]

I 'll be flying in economy class

work page
[52]

It 's just me, the details should be in my profile

work page
[53]

user_id":

No, I do not want to purchase travel insurance. assistant: Function(arguments= '{"user_id":"mia_li_3668"}', name= 'get_user_details') tool: {"name": {"first_name": "Mia", "last_name": "Li"}, "address": {"address1": "975 Sunset Drive", "address2": "Suite 217", "city": "Austin", "country": "USA", "province": "TX", "zip": "78750"}, "email": "mia.li3818@examp...

work page 1990
[54]

**Flight HAT069** - Departure: 06:00 AM EST - Arrival: 12:00 PM EST - Available Seats: 12 in Economy - Price: $121

work page
[55]

origin":

**Flight HAT083** - Departure: 01:00 AM EST - Arrival: 07:00 AM EST - Available Seats: 7 in Economy - Price: $100 Please confirm which flight you would like to book. 47 user: Both of the flights are scheduled too early for me. If possible, I would like a flight that does not depart before 11am EST. assistant: Function(arguments='{"origin":"JFK","destinati...

work page 2024
[56]

**Flight Option 1** - **First Leg:** - Flight Number: HAT136 - Departure: 07:00 PM EST from JFK - Arrival: 09:30 PM EST at ATL - Available Seats: 14 in Economy - Price: $152 - **Second Leg:** - Flight Number: HAT039 - Departure: 10:00 PM EST from ATL - Arrival: 03:00 AM EST (+1 day) at SEA - Available Seats: 10 in Economy - Price: $103

work page
[57]

user_id":

**Flight Option 2** 48 - **First Leg:** - Flight Number: HAT218 - Departure: 06:00 PM EST from JFK - Arrival: 08:30 PM EST at ATL - Available Seats: 1 in Economy - Price: $158 - **Second Leg:** - Flight Number: HAT039 - Departure: 10:00 PM EST from ATL - Arrival: 03:00 AM EST (+1 day) at SEA - Available Seats: 10 in Economy - Price: $103 Please confirm wh...

work page 1990