ToolEmu uses LM-based tool emulation to test LM agents on 36 high-stakes tools and 144 cases, revealing that even the safest agent fails 23.9% of the time.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2023 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
ToolEmu uses LM-based tool emulation to test LM agents on 36 high-stakes tools and 144 cases, revealing that even the safest agent fails 23.9% of the time.