pith. sign in

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025a

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

citation-role summary

background 2 dataset 2

citation-polarity summary

years

2026 15 2025 1

clear filters

representative citing papers

Qwen-AgentWorld: Language World Models for General Agents

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

Qwen-AgentWorld are language world models that simulate multi-domain agent environments and boost general agent capabilities via decoupled RL simulation and unified foundation model training.

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.

ComplexConstraints and Beyond: Expert Rubrics for RLVR

cs.AI · 2026-06-08 · unverdicted · novelty 5.0

Expert-curated rubrics in the new ComplexConstraints dataset improve LLM instruction following by 12-15% when used as RL training signals, with gains transferring to out-of-distribution agentic benchmarks.

GLM-5: from Vibe Coding to Agentic Engineering

cs.LG · 2026-02-17 · unverdicted · novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

citing papers explorer

Showing 16 of 16 citing papers.