{"work":{"id":"57acc3ec-f4c3-49ab-bd0f-5aab91002df9","openalex_id":null,"doi":null,"arxiv_id":"2604.06132","raw_key":null,"title":"Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents","authors":null,"authors_text":null,"year":2026,"venue":"cs.AI","abstract":"Large language models are increasingly deployed as autonomous agents for multi-step workflows in real-world software environments. However, existing agent benchmarks are limited by trajectory-opaque grading, underspecified safety and robustness evaluation, and narrow coverage of modalities and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing these gaps with 300 human-verified tasks spanning 9 categories across three groups: general service orchestration, multimodal perception and interaction, and multi-turn professional dialogue. To enable trajectory-aware grading, each run is recorded through three independent evidence channels: execution traces, audit logs, and environment snapshots, yielding 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, with Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models show that: (1) Trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures detected by our framework. (2) Capability does not imply consistency, with Pass@3 remaining stable under error injection while Pass^3 dropping by up to 24 percentage points. (3) Agent capability is strongly multi-dimensional, with model rankings varying across task groups and metrics, indicating that our heterogeneous evaluation coverage is essential. Claw-Eval highlights directions for developing agents that are not only capable but reliably deployable.","external_url":"https://arxiv.org/abs/2604.06132","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-13T07:32:30.465529+00:00","pith_arxiv_id":"2604.06132","created_at":"2026-05-09T05:55:29.667206+00:00","updated_at":"2026-05-14T17:03:06.181135+00:00","title_quality_ok":true,"display_title":"Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents","render_title":"Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents"},"hub":{"state":{"work_id":"57acc3ec-f4c3-49ab-bd0f-5aab91002df9","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":11,"external_cited_by_count":null,"distinct_field_count":4,"first_pith_cited_at":"2026-04-13T00:27:32+00:00","last_pith_cited_at":"2026-05-12T17:59:58+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-14T21:56:17.835469+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"use_dataset","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}