A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.
Trace: Trajectory-aware comprehensive evaluation for deep research agents
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
citing papers explorer
-
Holistic Evaluation and Failure Diagnosis of AI Agents
A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.