ToolBench-X is a new benchmark for tool-using LLM agents featuring five structured recoverable hazards across domains and workflows, with experiments showing a substantial reliability gap driven by poor hazard diagnosis rather than tool volume or compute.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.
The paper proposes a unified MDP-based research agenda for addressing sim-to-real gaps in foundation model agents and advocates adopting classical solutions such as domain randomization.
citing papers explorer
-
Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability
ToolBench-X is a new benchmark for tool-using LLM agents featuring five structured recoverable hazards across domains and workflows, with experiments showing a substantial reliability gap driven by poor hazard diagnosis rather than tool volume or compute.
-
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
-
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments
NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.
-
The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
The paper proposes a unified MDP-based research agenda for addressing sim-to-real gaps in foundation model agents and advocates adopting classical solutions such as domain randomization.