super hub Mixed citations

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Karthik Narasimhan, Noah Shinn, Pedram Razavi, Shunyu Yao · 2024 · cs.AI · arXiv 2406.12045

Mixed citation behavior. Most common role is background (61%).

185 Pith papers citing it

Background 61% of classified citations

open full Pith review browse 185 citing papers more from Karthik Narasimhan arXiv PDF

abstract

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 22 dataset 12 extension 1 method 1

citation-polarity summary

background 22 use dataset 9 unclear 3 extend 1 use method 1

claims ledger

abstract Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate th

authors

Karthik Narasimhan Noah Shinn Pedram Razavi Shunyu Yao

co-cited works

representative citing papers

Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes

cs.AI · 2026-06-16 · accept · novelty 8.0

LLM user simulators exhibit a disengagement deficit: they match real buyers but systematically overstate purchase intent among real non-buyers by reducing expressed resistance and increasing deliberation.

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions

cs.SE · 2026-07-02 · unverdicted · novelty 7.0

UnderSpecBench shows coding agents guess and violate boundaries in 55.8-67.8% of underspecified DevOps tasks rather than clarifying or refusing.

A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendly LP Benchmark via Inverse-KKT Construction

cs.AI · 2026-07-02 · conditional · novelty 7.0

A²utoLPBench is a generator that produces unlimited LP word problems with ground-truth answers known by construction via inverse-KKT, bundled with a Docker environment for agent evaluation.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

Entity Binding Failures in Tool-Augmented Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

The paper defines entity binding failures as a distinct error category in tool-augmented agents separate from tool selection errors and evaluates entity-aware mechanisms that eliminate such failures in a controlled diagnostic setting.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

CFAgentBench: A Reproducible Environment and Benchmark for Autonomous Construction-Finance Agents

cs.AI · 2026-06-20 · unverdicted · novelty 7.0

CFAgentBench is a new reproducible benchmark for construction-finance AI agents featuring 35 mock apps, 1,014 tasks, and a money-movement guard, with initial tests showing pass^1 of 0.67 dropping to pass^5 of 0.38.

Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering

cs.AI · 2026-06-18 · unverdicted · novelty 7.0 · 2 refs

Introduces the Power Systems Agent Benchmark with 41 task families across eight power engineering areas for executable evaluation of AI agents using deterministic feasibility checks.

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

SAGE-OPD improves multi-turn OPD via turn-level selective intervention, teacher-confidence weighting, and loss normalization, reporting up to 13.3% relative gain in ALFWorld unseen success rate over standard OPD.

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

cs.SE · 2026-06-17 · unverdicted · novelty 7.0

StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

cs.LG · 2026-06-14 · unverdicted · novelty 7.0

KV caches function as notebooks of prefilled conclusions, enabling field-level edits that recover decisions (especially with CoT) and position-portable skill composition with near-identical outputs at O(L) cost.

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

AgentBeats implements agentified evaluation of diverse AI agents through standardized interfaces, validated at scale in a five-month competition with 298 judges and 467 subjects plus a coding case study.

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

cs.CL · 2026-06-09 · conditional · novelty 7.0

ISE creates 23,132 execution-grounded multi-turn OS agent trajectories via intent simulation and live execution, improving agent performance on ClawEval from 19.3 to 37.7 pass@1 with Qwen3-8B.

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

cs.AI · 2026-06-05 · unverdicted · novelty 7.0

MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.

citing papers explorer

Showing 16 of 16 citing papers after filters.

Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 40 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment cs.AI · 2025-06-09 · unverdicted · none · ref 26 · internal anchor
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation cs.CL · 2025-12-23 · conditional · none · ref 3 · internal anchor
DIAL uses iterative adversarial learning to restore lexical diversity in user simulators and achieve strong correlation with real failure rates while keeping low divergence in failure mode distributions for mental health dialogues.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 66 · internal anchor
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners cs.HC · 2025-09-30 · unverdicted · none · ref 67 · internal anchor
A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.
DoubleAgents: Human-Agent Alignment in a Socially Embedded Workflow cs.HC · 2025-09-16 · unverdicted · none · ref 50 · internal anchor
DoubleAgents shows that a distributed-cognition design with coordination agent, dashboard, and policy module increases user comfort and reliance on AI agents for coordination tasks over time.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 148 · internal anchor
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky cs.AI · 2025-07-04 · unverdicted · none · ref 1 · internal anchor
DiaFORGE is a disambiguation-centric fine-tuning pipeline that boosts LLM tool-calling success by 27-49 percentage points on a dynamic benchmark compared to GPT-4o and Claude-3.5-Sonnet.
gpt-oss-120b & gpt-oss-20b Model Card cs.CL · 2025-08-08 · unverdicted · none · ref 23 · internal anchor
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 90 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Small Language Models are the Future of Agentic AI cs.AI · 2025-06-02 · unverdicted · none · ref 81 · internal anchor
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 64 · internal anchor
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges cs.AI · 2025-10-27 · unverdicted · none · ref 234 · internal anchor
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models cs.CL · 2025-08-08 · unverdicted · none · ref 48 · internal anchor
GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 94 · internal anchor
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 89 · internal anchor
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer