hub

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan · 2024 · cs.AI · arXiv 2406.12045

66 Pith papers cite this work. Polarity classification is still indexing.

66 Pith papers citing it

open full Pith review browse 66 citing papers arXiv PDF

abstract

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 dataset 1

citation-polarity summary

background 3 use dataset 1

claims ledger

abstract Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate th

co-cited works

representative citing papers

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

cs.SD · 2026-05-13 · accept · novelty 8.0

EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

cs.CL · 2026-05-10 · conditional · novelty 7.0

LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

FORTIS: Benchmarking Over-Privilege in Agent Skills

cs.AI · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

FORTIS benchmark shows over-privilege is the norm in LLM agent skill selection and execution, with models reaching for higher-privilege skills and tools than required across ten frontier models and three domains.

Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0

MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achieving top accuracy on most benchmarks and 6.69x speedup over baselines.

Switchcraft: AI Model Router for Agentic Tool Calling

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.

AcademiClaw: When Students Set Challenges for AI Agents

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

cs.AI · 2026-05-03 · accept · novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under integrity evaluation.

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

cs.AI · 2026-05-01 · unverdicted · novelty 7.0

TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correct answer, dollars per correct answer, and endpoint fidelity.

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

cs.GR · 2026-04-28 · unverdicted · novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

cs.AI · 2026-04-27 · conditional · novelty 7.0

AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment predicts external adoption metrics.

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.

AutomationBench

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

AutomationBench is a new benchmark for AI agents on cross-application REST API workflows where even top models score below 10%.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

cs.AI · 2026-04-17 · conditional · novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.

PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

cs.AI · 2026-04-08 · unverdicted · novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

citing papers explorer

Showing 50 of 66 citing papers.

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents cs.SD · 2026-05-13 · accept · none · ref 33 · internal anchor
EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 50 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 34 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation cs.CL · 2026-04-13 · unverdicted · none · ref 22 · internal anchor
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints cs.AI · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 61 · internal anchor
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems cs.AI · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 71 · internal anchor
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
LLM Agents Already Know When to Call Tools -- Even Without Reasoning cs.CL · 2026-05-10 · conditional · none · ref 30 · internal anchor
LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 59 · internal anchor
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
FORTIS: Benchmarking Over-Privilege in Agent Skills cs.AI · 2026-05-09 · unverdicted · none · ref 19 · 2 links · internal anchor
FORTIS benchmark shows over-privilege is the norm in LLM agent skill selection and execution, with models reaching for higher-privilege skills and tools than required across ten frontier models and three domains.
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents? cs.CL · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals cs.SE · 2026-05-08 · unverdicted · none · ref 48 · internal anchor
MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achieving top accuracy on most benchmarks and 6.69x speedup over baselines.
Switchcraft: AI Model Router for Agentic Tool Calling cs.AI · 2026-05-08 · unverdicted · none · ref 37 · internal anchor
Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.
AcademiClaw: When Students Set Challenges for AI Agents cs.AI · 2026-05-04 · unverdicted · none · ref 21 · internal anchor
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles cs.AI · 2026-05-03 · accept · none · ref 29 · internal anchor
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under integrity evaluation.
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference cs.AI · 2026-05-01 · unverdicted · none · ref 21 · internal anchor
TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correct answer, dollars per correct answer, and endpoint fidelity.
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation cs.GR · 2026-04-28 · unverdicted · none · ref 40 · internal anchor
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment cs.AI · 2026-04-27 · conditional · none · ref 14 · internal anchor
AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment predicts external adoption metrics.
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents cs.CV · 2026-04-26 · unverdicted · none · ref 1 · internal anchor
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
AutomationBench cs.AI · 2026-04-21 · unverdicted · none · ref 7 · internal anchor
AutomationBench is a new benchmark for AI agents on cross-application REST API workflows where even top models score below 10%.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench cs.AI · 2026-04-17 · conditional · none · ref 28 · internal anchor
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent cs.AI · 2026-04-08 · unverdicted · none · ref 22 · internal anchor
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration cs.AI · 2026-04-03 · conditional · none · ref 7 · internal anchor
Iterative Reward Calibration with MT-GRPO and GTPO enables effective multi-turn RL for tool-calling agents, raising Tau-Bench success from 63.8% to 66.7% for a 4B model and from 58.0% to 69.5% for a 30B model.
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment cs.AI · 2025-06-09 · unverdicted · none · ref 26 · internal anchor
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction cs.AI · 2026-05-13 · unverdicted · none · ref 21 · internal anchor
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces cs.CR · 2026-05-12 · unverdicted · none · ref 88 · internal anchor
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World cs.AI · 2026-05-11 · unverdicted · none · ref 46 · internal anchor
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox cs.AI · 2026-05-11 · unverdicted · none · ref 14 · internal anchor
ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation cs.AI · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 45 · internal anchor
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents cs.AI · 2026-05-09 · unverdicted · none · ref 24 · 2 links · internal anchor
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
Tool Calling is Linearly Readable and Steerable in Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 77 · internal anchor
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems cs.AI · 2026-05-07 · unverdicted · none · ref 19 · internal anchor
ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percentage points.
Agentic Coding Needs Proactivity, Not Just Autonomy cs.SE · 2026-05-07 · conditional · none · ref 35 · internal anchor
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone cs.AI · 2026-05-06 · unverdicted · none · ref 21 · internal anchor
Deployment-relevant AI alignment cannot be inferred from model-level evaluations alone, as benchmark audits show missing interaction support and cross-model tests reveal model-dependent scaffold effects.
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go? cs.AI · 2026-05-01 · unverdicted · none · ref 32 · internal anchor
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows cs.SE · 2026-04-30 · unverdicted · none · ref 48 · internal anchor
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
CL-bench Life: Can Language Models Learn from Real-Life Context? cs.CL · 2026-04-29 · unverdicted · none · ref 72 · internal anchor
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
Human-Guided Harm Recovery for Computer Use Agents cs.AI · 2026-04-20 · conditional · none · ref 14 · internal anchor
Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum cs.AI · 2026-04-20 · unverdicted · none · ref 14 · internal anchor
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specific scheduling.
LLMs Corrupt Your Documents When You Delegate cs.CL · 2026-04-17 · unverdicted · none · ref 94 · internal anchor
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
LinuxArena: A Control Setting for AI Agents in Live Production Software Environments cs.CR · 2026-04-16 · unverdicted · none · ref 11 · internal anchor
LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.
BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows cs.AI · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
BankerToolBench is a new open benchmark of end-to-end investment banking workflows developed with 502 bankers; even the best tested model (GPT-5.4) fails nearly half the expert rubric criteria and produces zero client-ready outputs.
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents cs.CL · 2026-04-08 · conditional · none · ref 11 · internal anchor
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents cs.AI · 2026-04-07 · unverdicted · none · ref 50 · internal anchor
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 153 · internal anchor
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability cs.AI · 2026-05-11 · unverdicted · none · ref 34 · internal anchor
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants cs.CL · 2026-05-10 · unverdicted · none · ref 7 · internal anchor
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer