General Agent Evaluation

Elron Bandel , Asaf Yehudai , Lilach Eden , Yehoshua Sagron , Yotam Perlitz , Elad Venezian , Natalia Razinkov , Natan Ergas

show 7 more authors

Shlomit Shachor Ifergan Segev Shlomov Michal Jacovi Leshem Choshen Liat Ein-Dor Yoav Katz Michal Shmueli-Scheuer

Authors on Pith no claims yet

classification 💻 cs.AI

keywords agentagentsbenchmarksgeneralbackbonemodelmodelstested

0 comments

read the original abstract

General-purpose agents perform tasks in unfamiliar environments without domain-specific manual customization. Yet no study has systematically measured how agent architecture shapes performance across heterogeneous protocols and diverse unfamiliar environments. This is the first systematic study, comparing tool-calling, MCP, code-generation, and CLI agents on the same benchmarks with the same models. Two gaps blocked such a study: existing harnesses require per-benchmark wiring or fixed protocol classes (web for BrowserGym, CLI for Harbor), and benchmarks themselves expect human-authored prompts, context, and integration glue. To enable this study, we contribute (1) a unifying protocol that bridges existing benchmark and agent protocols; (2) an evaluation harness that surfaces any benchmark to any general-purpose agent and backbone model; and (3) the first Open General Agent Leaderboard of agent configurations, a full factorial over 5 agent architectures x 5 backbone LLMs (three closed-source, two open-weight) x 6 benchmarks spanning software engineering, customer service, deep research, and personal assistance. We find that (i) general agents adapt to every tested domain without per-domain customization; (ii) agent architecture choice swings results by up to 12pp within a single model, yet backbone model choice dominates overall performance; (iii) on 4 of 6 tested benchmarks, top general agents are indistinguishable from the leading heavily-customized domain-specific agents; (iv) open-weight models tested exhibit "generality sinks" absent from frontier closed-source models: they consistently collapse on specific agent architectures or benchmarks; (v) a behavioral failure analysis reveals architecture-distinctive error signatures that aggregate scoring cannot discriminate. Code, harness, leaderboard, and traces are at https://www.exgentic.ai.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

HCL-GP learns parameterized policies and reuses extracted components to achieve 98% accuracy on AppWorld benchmark tasks for LLM agents, outperforming static synthesis by 15.8 points on challenges.
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

Circuit analysis reveals that routing circuits for agent memory emerge at 0.6B parameters while content circuits emerge at 4B, with a shared grounding hub and an unsupervised diagnostic achieving 76.2% accuracy for lo...
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 7...
Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
cs.SE 2026-04 unverdicted novelty 5.0

Agentic AI evaluation and governance lack mechanisms to bind obligations to actions and prove compliance at runtime; a new synthesis framework with ODTA criteria and action-evidence bundles addresses this closure gap.