HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

· 2026 · cs.LG · arXiv 2604.13954

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.

representative citing papers

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

OpenClawBench annotates 31,264 agent trajectories to show that roughly 9% of task-successful executions contain measurable process anomalies, and a fine-tuned detector reaches F1 0.729 on held-out data.

citing papers explorer

Showing 1 of 1 citing paper.

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories cs.AI · 2026-05-28 · unverdicted · none · ref 18 · internal anchor
OpenClawBench annotates 31,264 agent trajectories to show that roughly 9% of task-successful executions contain measurable process anomalies, and a fine-tuned detector reaches F1 0.729 on held-out data.

HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

fields

years

verdicts

representative citing papers

citing papers explorer