arxiv: 2308.03688 · v3 · submitted 2023-08-07 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

AgentBench: Evaluating LLMs as Agents

Xiao Liu , Hao Yu , Hanchen Zhang , Yifan Xu , Xuanyu Lei , Hanyu Lai , Yu Gu , Hangliang Ding

show 14 more authors

Kaiwen Men Kejuan Yang Shudan Zhang Xiang Deng Aohan Zeng Zhengxiao Du Chenhui Zhang Sheng Shen Tianjun Zhang Yu Su Huan Sun Minlie Huang Yuxiao Dong Jie Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 11:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM agentsbenchmark evaluationinteractive environmentsreasoning and decision-makingperformance disparityinstruction followingopen-source modelscommercial LLMs

0 comments

The pith

A benchmark of eight interactive environments reveals a large performance gap between top commercial LLMs and open-source models as agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates AgentBench to measure how effectively large language models can function as agents that reason and make decisions inside interactive settings. Testing across many commercial and open-source models shows the leading commercial systems handle these agent roles capably, yet many open-source alternatives no bigger than 70 billion parameters fall well short. This matters because it pinpoints concrete weaknesses in long-term planning, decision-making, and following instructions that block usable agent applications. The results also indicate that better instruction-following training and high-quality multi-turn alignment data offer paths to improvement, while code-focused training produces inconsistent effects across tasks.

Core claim

We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over numerous API-based and open-sourced LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following,

What carries the argument

AgentBench, the collection of eight interactive environments that directly tests an LLM's ability to sustain reasoning and choose actions over multiple steps.

If this is right

Poor long-term reasoning, decision-making, and instruction following are the primary obstacles to usable LLM agents.
Training focused on instruction following and high-quality multi-round alignment data can raise agent performance.
Training on code produces mixed effects that vary by the specific agent task.
A clear performance separation exists between leading commercial LLMs and open-source models at or below 70B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap suggests that model scale up to 70B alone does not guarantee strong agent behavior without changes in training data or objectives.
The benchmark could be used to track whether future alignment methods narrow the commercial-to-open-source difference over successive model releases.
Extending the environments to include more physical or long-horizon real-world tasks would test whether the identified failure modes persist outside the current set.

Load-bearing premise

The eight chosen environments and their metrics sufficiently represent the core challenges of real-world agent deployment, and the measured differences reflect actual reasoning gaps rather than prompt or setup artifacts.

What would settle it

Re-testing the same models with substantially altered prompts or additional environments and finding that open-source models under 70B close the gap to commercial ones would show the reported disparity depends on the specific evaluation choices.

read the original abstract

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentBench gives a practical new benchmark with eight environments that flags real gaps in LLM agent performance, especially for open-source models, though uniform prompting likely inflates the commercial edge.

read the letter

The core takeaway is that this paper ships a new benchmark called AgentBench built from eight distinct interactive environments and uses it to show top commercial LLMs handle agent tasks better than most open-source models at or below 70B parameters. The main weaknesses they surface are long-term reasoning, decision-making, and instruction following, and they release the code, datasets, and evaluation package so others can run it directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AgentBench, a multi-dimensional benchmark consisting of 8 distinct interactive environments to quantitatively evaluate LLMs as agents, focusing on their reasoning and decision-making abilities. Extensive testing across numerous API-based commercial and open-source LLMs (up to 70B parameters) reveals that top commercial models exhibit strong agentic performance in complex settings, while identifying a significant performance disparity with many OSS competitors. The authors analyze common failure modes, attributing them primarily to deficiencies in long-term reasoning, decision-making, and instruction following, and propose that improvements in instruction following and training on high-quality multi-round alignment data could help; they also observe ambivalent effects from code training on agent tasks. The benchmark, datasets, environments, and an integrated evaluation package are released publicly.

Significance. If the reported performance disparities prove robust, this benchmark would be a timely and useful contribution to the growing field of LLM agents by providing standardized, multi-environment evaluation that highlights gaps between commercial and open-source models. The public release of the full evaluation framework, datasets, and code is a clear strength that supports reproducibility and community follow-up work. The identification of specific failure modes (e.g., instruction following) offers actionable insights, though the overall significance is tempered by the empirical nature of the study and the need for stronger controls on evaluation artifacts.

major comments (2)

[§4] §4 (Experiments/Evaluation Protocol): The evaluation applies a single, uniform prompting template and interaction protocol across all models without any ablation on prompt variations, model-specific few-shot examples, or relaxed output-format constraints. This is load-bearing for the central disparity claim because commercial models' documented advantage in instruction following (one of the primary failure modes identified) is likely amplified by RLHF, raising the possibility that a non-trivial portion of the gap versus OSS models ≤70B is an artifact of prompt sensitivity rather than intrinsic differences in reasoning or decision-making.
[§3] §3 (Environments) and §4: Insufficient detail is provided on environment construction, exact metric definitions, controls for prompt sensitivity, environment stochasticity, and statistical significance testing (e.g., variance across multiple runs or seeds). Without these, the reliability of the performance numbers and the claim that the eight environments sufficiently isolate core agentic challenges cannot be fully assessed.

minor comments (2)

[Abstract] The abstract and §5 (Analysis) could more explicitly quantify the performance gap (e.g., specific success rates or normalized scores for top commercial vs. OSS models) rather than describing it qualitatively as 'significant'.
[Figures] Figure captions and axis labels in the results figures would benefit from greater clarity on what the plotted metrics precisely measure (e.g., success rate, partial credit, or normalized scores).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing our strongest honest defense of the manuscript while committing to revisions that improve clarity and rigor without altering the core claims or experimental design.

read point-by-point responses

Referee: [§4] §4 (Experiments/Evaluation Protocol): The evaluation applies a single, uniform prompting template and interaction protocol across all models without any ablation on prompt variations, model-specific few-shot examples, or relaxed output-format constraints. This is load-bearing for the central disparity claim because commercial models' documented advantage in instruction following (one of the primary failure modes identified) is likely amplified by RLHF, raising the possibility that a non-trivial portion of the gap versus OSS models ≤70B is an artifact of prompt sensitivity rather than intrinsic differences in reasoning or decision-making.

Authors: The uniform prompting template and interaction protocol were selected to create a standardized, reproducible benchmark that enables fair head-to-head comparison of models' agentic abilities without confounding factors from per-model prompt engineering or few-shot tuning. Allowing model-specific adaptations would undermine the benchmark's purpose of measuring intrinsic capabilities under consistent conditions, as is standard in many LLM evaluation suites. The manuscript already highlights instruction following as a primary failure mode through qualitative analysis of outputs, and the observed performance gaps are consistent with this; commercial models' RLHF advantages in this area are a legitimate component of their superior agent performance rather than an artifact to be removed. We disagree that this renders the disparity claim unreliable, but we will add an explicit discussion of prompt sensitivity limitations and the rationale for standardization in the revised manuscript. revision: partial
Referee: [§3] §3 (Environments) and §4: Insufficient detail is provided on environment construction, exact metric definitions, controls for prompt sensitivity, environment stochasticity, and statistical significance testing (e.g., variance across multiple runs or seeds). Without these, the reliability of the performance numbers and the claim that the eight environments sufficiently isolate core agentic challenges cannot be fully assessed.

Authors: We will substantially expand Sections 3 and 4 in the revision to provide greater detail on environment construction processes, precise mathematical definitions of all metrics, any existing controls for prompt sensitivity and stochasticity, and results from multiple runs with variance reporting where environments permit. These additions will strengthen the justification that the eight environments collectively isolate core agentic challenges such as long-term reasoning and decision-making. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivations or self-referential claims

full rationale

The paper is an empirical evaluation study that runs LLMs on 8 fixed environments and reports observed success rates, failure modes, and performance gaps. No equations, first-principles derivations, fitted parameters, or predictions appear in the abstract or described structure. Claims rest on direct experimental outcomes rather than any chain that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any derivation because no derivation exists. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical evaluation framework rather than a theoretical derivation; it relies on standard assumptions about LLM prompting but introduces no new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)

domain assumption LLMs can be effectively prompted to function as interactive agents in multi-step environments
The benchmark construction and evaluation rest on this premise for all tested models and environments.

pith-pipeline@v0.9.0 · 5570 in / 1305 out tokens · 73551 ms · 2026-05-11T11:34:51.717537+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
cs.AI 2026-05 conditional novelty 8.0

PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
cs.CL 2023-08 unverdicted novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
cs.AI 2026-05 unverdicted novelty 7.0

RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
cs.SE 2026-05 conditional novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
cs.MA 2026-05 unverdicted novelty 7.0

EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
cs.CR 2026-05 unverdicted novelty 7.0

LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems
cs.AI 2026-05 unverdicted novelty 7.0

Partial Evidence Bench is a deterministic benchmark that measures agent correctness, completeness awareness, gap-report quality, and unsafe overclaiming in authorization-constrained evidence environments.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
cs.GR 2026-04 unverdicted novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
cs.CV 2026-04 unverdicted novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Trust, Lies, and Long Memories: Emergent Social Dynamics and Reputation in Multi-Round Avalon with LLM Agents
cs.MA 2026-04 unverdicted novelty 7.0

In 188 multi-round Avalon games, LLM agents with cross-game memory form reputations that boost high-reputation players' team inclusions by 46% and show more strategic deception (75% vs 36%) with higher reasoning effort.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
cs.CL 2026-04 unverdicted novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
WhatIf: Interactive Exploration of LLM-Powered Social Simulations for Policy Reasoning
cs.HC 2026-04 unverdicted novelty 7.0

WhatIf provides an interactive platform for real-time exploration of LLM-driven social simulations, enabling policymakers to iteratively test plans, reflect on assumptions, and uncover vulnerabilities in emergency pre...
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
cs.AI 2026-04 conditional novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search
cs.CV 2026-04 unverdicted novelty 7.0

ARGOS is the first benchmark reformulating multi-camera person search as an agentic interactive reasoning task grounded in a spatio-temporal topology graph, with 2691 tasks across three tracks where current LLMs achie...
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
cs.AI 2026-04 unverdicted novelty 7.0

ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
cs.AI 2026-04 unverdicted novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
SAGE: A Service Agent Graph-guided Evaluation Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
cs.CL 2026-04 unverdicted novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation
cs.CY 2026-03 conditional novelty 7.0

M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
cs.AI 2025-06 unverdicted novelty 7.0

τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
cs.AI 2024-06 unverdicted novelty 7.0

τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
cs.LG 2024-03 unverdicted novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Exploiting LLM Agent Supply Chains via Payload-less Skills
cs.CR 2026-05 conditional novelty 6.0

Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
cs.AI 2026-05 unverdicted novelty 6.0

HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
cs.CL 2026-05 unverdicted novelty 6.0

AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.
EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair
cs.SE 2026-05 unverdicted novelty 6.0

EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems
cs.AI 2026-05 unverdicted novelty 6.0

ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percen...
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 unverdicted novelty 6.0

NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
cs.AI 2026-05 unverdicted novelty 6.0

The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
cs.AI 2026-05 unverdicted novelty 6.0

Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
LATTICE: Evaluating Decision Support Utility of Crypto Agents
cs.CR 2026-04 unverdicted novelty 6.0

LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
cs.LG 2026-04 unverdicted novelty 6.0

ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
CHORUS: An Agentic Framework for Generating Realistic Deliberation Data
cs.AI 2026-04 unverdicted novelty 6.0

Chorus generates realistic deliberation discussions via LLM agents with memory and Poisson-timed participation, validated by 30 experts on realism, coherence, and utility.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
How Adversarial Environments Mislead Agentic AI?
cs.AI 2026-04 unverdicted novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
cs.MA 2026-04 unverdicted novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 83 Pith papers · 3 internal anchors

[1]

PaLM: Scaling Language Modeling with Pathways

Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.91. URL https://aclanthology.org/2020.findings-emnlp.91. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.Se...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.findings-emnlp.91 2020
[2]

Measuring Coding Challenge Competence With APPS

Association for Computational Linguistics. URL https://aclanthology.org/ 2023.acl-long.270. Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interac- tive fiction games: A colossal adventure. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 7903–7910, 2020. Dan Hendrycks, Steven Basart, Saura...

work page internal anchor Pith review arXiv 2023
[3]

doi: 10.18653/v1/P17-1167

Association for Computational Linguistics. doi: 10.18653/v1/P17-1167. URL https: //aclanthology.org/P17-1167. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vo...

work page doi:10.18653/v1/p17-1167 2017
[4]

Llama 2: Open Foundation and Fine-Tuned Chat Models

The Association for Computational Linguistics, 2016. doi: 10.18653/v1/d16-1054. URL https://doi.org/10.18653/v1/d16-1054. Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex questions. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolog...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d16-1054 2016
[5]

Act: bash ‘‘‘bash # put your bash code here ‘‘‘

If you think you should execute some bash code, take bash action, and you should print like this: Think: put your thought here. Act: bash ‘‘‘bash # put your bash code here ‘‘‘

work page
[6]

Act: finish

If you think you have finished the task, take finish action, and you should print like this: Think: put your thought here. Act: finish

work page
[7]

bash": <User>: The output of the OS: {{ OUTPUT }}

If you think you have got the answer to the question, take answer action, and you should print like this: Think: put your thought here. Act: answer(Your answer to the question should be put in this pair of parentheses) If the output is too long, I will truncate it. The truncated output is not complete. You have to deal with the truncating problem by yours...

work page 2024
[8]

This function helps to navigate all relations in the KB connected to the variable, so you can decide which relation is the most useful to find the answer to the question

get_relations(variable: var) -> list of relations A variable can be either an entity or a set of entities (i.e., the result of a previous query). This function helps to navigate all relations in the KB connected to the variable, so you can decide which relation is the most useful to find the answer to the question. A simple use case can be ‘get_relations(...

work page
[9]

Note that, get_neighbors() can only be used after get_relations() is used to find a set of viable relations

get_neighbors(variable: var, relation: str) -> variable Given a variable, this function returns all entities connected to the variable via the given relation. Note that, get_neighbors() can only be used after get_relations() is used to find a set of viable relations. A simple use case can be ‘get_neighbors(Barack Obama, people.person. profession)’, which ...

work page
[10]

The two variables MUST be of the same type! 4https://github.com/dki-lab/Freebase-Setup 27 Published as a conference paper at ICLR 2024

intersection(variable1: var, variable2: var) -> variable Given two variables, this function returns the intersection of the two variables. The two variables MUST be of the same type! 4https://github.com/dki-lab/Freebase-Setup 27 Published as a conference paper at ICLR 2024

work page 2024
[11]

Please only use it if the question seeks for a superlative accumulation (i.e., argmax or argmin)

get_attributes(variable: var) -> list of attributes This function helps to find all numerical attributes of the variable. Please only use it if the question seeks for a superlative accumulation (i.e., argmax or argmin)

work page
[12]

It can only be used after get_attributes() is used to find a set of viable attributes

argmax(variable: var, attribute: str) -> variable Given a variable, this function returns the entity with the maximum value of the given attribute. It can only be used after get_attributes() is used to find a set of viable attributes. A simple use case can be ‘argmax(variable, age)’, which returns the oldest entity belonging to the variable

work page
[13]

It can only be used after get_attributes() is used to find a set of viable attributes

argmin(variable: var, attribute: str) -> variable Given a variable, this function returns the entity with the minimum value of the given attribute. It can only be used after get_attributes() is used to find a set of viable attributes. A simple use case can be ‘argmin(variable, age)’, which returns the youngest entity belonging to the variable

work page
[14]

Counter: Deal 30 damage to attacker when a teammate’s health is below 30%

count(variable: var) -> int Given a variable, this function returns the number of entities belonging to the variable. After a variable is produced along the process, you need to judge whether a variable is the final answer to the question. Each variable is represented as an id starting from 0. For example, #0 is the first variable, #1 is the second variab...

work page 2024
[15]

They want to count the steps of the abandoned building

work page
[16]

A supernatural event occurred

work page
[17]

They saw a claim online: counting stairs at night will result in one step less

work page
[18]

The next day, when they went to the abandoned building to verify, they found no stairs

work page
[19]

The number of key points varies among samples

They broke down because they were terrified. The number of key points varies among samples. As for the decision of whether the agent guess out key points, we first change relevant questions into declarative sentences, then simplify sentences into one sentence. After guessing out a key point, we delete that key point and relevant inferences to avoid repeat...

work page
[20]

story". Based on the story, you need to ask questions that can be answered with

At the beginning of the game, you will receive a narrative, referred to as "story". Based on the story, you need to ask questions that can be answered with "yes", "no", or "irrelevant" to guees out the " truth"

work page
[21]

By asking questions, you narrow down the range of possibilities until you eventually guess out the truth

work page
[22]

Each time, you can only ask one question

work page
[23]

You cannot declare the end of the game, give up on reasoning, or request a new game

Remember that your role is a player. You cannot declare the end of the game, give up on reasoning, or request a new game

work page
[24]

35 Published as a conference paper at ICLR 2024

You cannot directly repeat information already provided in the story. 35 Published as a conference paper at ICLR 2024

work page 2024
[25]

You cannot directly ask for details about the story in the form of " why" questions; you need to make your own guesses for truth

work page
[26]

yes", "no

You cannot directly inquire about the story; you must make your own deductions. Next, please make full use of the information provided above to engage in game reasoning. Keep in mind that your questions should be answerable with "yes", "no", or "irrelevant", and you can only ask one question at a time. Here is your story: {story} You can start guessing th...

work page
[34]

yes", "no

During the user’s process of guessing the truth, if they come close to some truths but still have gaps in understanding the complete truth of the truth, you can provide certain entry point hints. However, you cannot directly reveal information from the truth. During the game process, please adhere to the above game rules to ensure a positive gaming experi...

work page 2024
[35]

story" and the

You know both the "story" and the "truth". When a user wants to play Lateral Thinking Puzzle, you provide them with the "story". The user only knows the "story" and is unawared of the "truth"

work page
[36]

yes," "no,

The user asks questions that can be answered with "yes," "no," or " irrelevant". Their questions are aimed at guessing the "truth". Based on the "truth", you respond to the user’s questions using "yes," "no ," or "irrelevant" to guide them towards guessing the correct truth

work page
[37]

If the user directly asks for details about the truth using the form of "why" questions, inform them that they need to make their own guesses

work page
[38]

Based on the information of the truth and the user’s past questions, you answer the user’s questions

You must fully understand and accurately interpret the information from the truth. Based on the information of the truth and the user’s past questions, you answer the user’s questions. The user’s questions may not necessarily contain information from the truth, but your responses must align with the facts of the truth

work page
[39]

irrelevant

You can only answer "irrelevant" when the truth cannot provide a direct or indirect answer. Note that this is the only condition for responding "irrelevant"; otherwise, you should answer "yes" or "no."

work page
[40]

You cannot directly disclose the information from the truth to the user, even if they ask directly

work page
[41]

Avoid answering based solely on a particular point; your responses must align with the facts of the truth

You need to judge the user’s questions as a whole and understand their overall intent. Avoid answering based solely on a particular point; your responses must align with the facts of the truth

work page
[42]

yes", "no

During the user’s process of guessing the truth, if they come close to some truths but still have gaps in understanding the complete truth of the truth, you can provide certain entry point hints. However, you cannot directly reveal information from the truth. USER: Alright, we can now start the game. Remember, before each response, you should review the k...

work page 2024
[43]

Any actions except provided available actions will be regarded as illegal

the action must be chosen from the given available actions. Any actions except provided available actions will be regarded as illegal

work page
[44]

WebShop [SEP] Instruction: [SEP] i need a long lasting 6.76 fl oz bottle of l’eau d’issey, and price lower than 100.00 dollars [SEP] Search

Think when necessary, try to act directly more in the process. All the tasks in the datasets are categorized into six classes. To better guide the model in accomplishing the objectives, we have selected one relatively simple example of successful completion of similar tasks for each category as 1-shot example. Here is an example: User: You are in the midd...

work page 2024