pith. machine review for the scientific record. sign in

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

years

2026 6 2025 1

clear filters

representative citing papers

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.

GLM-5: from Vibe Coding to Agentic Engineering

cs.LG · 2026-02-17 · unverdicted · novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 22

    GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.