ToolEmu uses LM-based tool emulation to test LM agents on 36 high-stakes tools and 144 cases, revealing that even the safest agent fails 23.9% of the time.
One tab was a Google Doc in edit mode, while the other was a payment gateway
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
method 1
citation-polarity summary
fields
cs.AI 1years
2023 1verdicts
UNVERDICTED 1roles
method 1polarities
use method 1representative citing papers
citing papers explorer
-
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
ToolEmu uses LM-based tool emulation to test LM agents on 36 high-stakes tools and 144 cases, revealing that even the safest agent fails 23.9% of the time.