TerminalWorld builds a scalable benchmark of 1,530 real terminal tasks from recordings and finds frontier models and agents reach at most 62.5% pass rate with only weak correlation to prior expert-curated sets.
Large-scale terminal agentic trajectory generation from dockerized environments.CoRR, abs/2602.01244
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
CLI-Universe synthesizes a verified 6K dataset of terminal-agent tasks that, when used to fine-tune Qwen3-32B, reaches 33.4% on Terminal-Bench 2.0 and sets a new open-source SOTA for models at or below 32B parameters.
Tmax is an open RL training recipe for terminal agents that achieves 27% on Terminal-Bench 2.0 with a 9B model via a novel data generation taxonomy combining difficulty control, personas, and verifier diversification.
Trajectories from weaker agents outperform stronger ones for training terminal agents due to environment-grounded supervision that exposes inspect-act-verify behaviors.
LiteCoder-Terminal-Gen creates synthetic terminal datasets that, after SFT and DMPO on Qwen models, yield 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro.
HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.
citing papers explorer
-
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents
CLI-Universe synthesizes a verified 6K dataset of terminal-agent tasks that, when used to fine-tune Qwen3-32B, reaches 33.4% on Terminal-Bench 2.0 and sets a new open-source SOTA for models at or below 32B parameters.
-
What Makes Interaction Trajectories Effective for Training Terminal Agents?
Trajectories from weaker agents outperform stronger ones for training terminal agents due to environment-grounded supervision that exposes inspect-act-verify behaviors.