TerminalWorld builds a scalable benchmark of 1,530 real terminal tasks from recordings and finds frontier models and agents reach at most 62.5% pass rate with only weak correlation to prior expert-curated sets.
Large-scale terminal agentic trajectory generation from dockerized environments.CoRR, abs/2602.01244
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
LiteCoder-Terminal-Gen creates synthetic terminal datasets that, after SFT and DMPO on Qwen models, yield 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro.
HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.
citing papers explorer
-
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
TerminalWorld builds a scalable benchmark of 1,530 real terminal tasks from recordings and finds frontier models and agents reach at most 62.5% pass rate with only weak correlation to prior expert-curated sets.
-
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents
LiteCoder-Terminal-Gen creates synthetic terminal datasets that, after SFT and DMPO on Qwen models, yield 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro.
-
HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.