AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.
Escapebench: Pushing language models to think outside the box
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.