TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents
Pith reviewed 2026-06-30 01:16 UTC · model grok-4.3
The pith
TUA-Bench evaluates terminal agents on 120 general tasks and finds the top model at 65.8 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TUA-Bench consists of 120 real-world tasks across five families for terminal-use agents. The tasks address routine activities such as document editing and email management together with scientific and engineering workflows co-designed with domain experts. Each task runs in a real terminal under a deterministic setup script and receives an execution-based score. The strongest frontier agent, Claude Code powered by Claude Opus 4.8 at maximum reasoning effort, reaches 65.8 percent overall performance with substantial gaps across both routine and expert tracks.
What carries the argument
TUA-Bench benchmark of 120 manually designed tasks with deterministic terminal setups and execution-based scoring protocol.
If this is right
- Terminal agents must close large performance gaps before they can be considered reliable for general use.
- Routine tasks such as document editing and email management expose limitations not addressed by coding-only benchmarks.
- Scientific workflows that require specialized software demand agent capabilities beyond standard shell commands.
- Execution-based scoring offers an objective alternative to human judgment for terminal agent evaluation.
- Future agent development can target measurable improvement on both routine and expert task families.
Where Pith is reading between the lines
- The benchmark could be extended with tasks that involve live state changes or multi-turn user interactions to test robustness further.
- The observed gaps suggest current agents would benefit from training data that emphasizes non-programming terminal workflows.
- Pairing TUA-Bench results with existing GUI benchmarks would allow direct comparison of terminal versus graphical computer-use performance.
- A score of 65.8 percent implies that unsupervised deployment of these agents in varied digital environments remains premature.
Load-bearing premise
The 120 manually designed tasks and the execution-based scoring protocol together provide a representative and unbiased measure of general-purpose terminal-use agent capabilities.
What would settle it
A new agent that scores above 90 percent on the 120 tasks yet fails on similar but unseen terminal tasks outside the benchmark set would indicate the tasks do not generalize.
read the original abstract
As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TUA-Bench, a benchmark for general-purpose terminal-use agents (TUAs) consisting of 120 manually designed tasks across five families. Tasks span routine activities (document editing, email, web search) and expert co-designed scientific/engineering workflows, each with deterministic setup scripts and execution-based scoring. The central empirical result is that the strongest frontier agent (Claude Code with Claude Opus 4.8 at max reasoning effort) reaches 65.8% overall success, with substantial gaps across tracks; the benchmark is positioned as more general than prior GUI or shell-focused suites.
Significance. If the task set is representative, TUA-Bench fills a documented gap between GUI-centric computer-use benchmarks and narrow coding/shell benchmarks, supplying a reproducible, execution-scored evaluation that could guide development of broader terminal agents. The deterministic setups and execution-based protocol are explicit strengths that support reproducibility and reduce scoring ambiguity.
major comments (2)
- [Abstract] Abstract: the claim that the 65.8% result and observed gaps constitute a 'broad and realistic evaluation' of general-purpose TUA capabilities is load-bearing on the representativeness of the 120 tasks; the abstract supplies no information on task selection process, coverage metrics, inter-rater reliability for task design, or external validation against real-world terminal usage distributions or expert consensus, leaving open the possibility that performance ceilings reflect curation choices rather than intrinsic limitations.
- [Section 3] Task construction description (Section 3 / benchmark design): while the five task families and PhD co-design are described, the manuscript provides no quantitative validation (e.g., diversity statistics, overlap with usage logs, or comparison to existing terminal corpora) that would confirm the tasks are unbiased relative to the general-purpose claim; this directly affects whether the 65.8% ceiling can be interpreted as a frontier measurement.
minor comments (1)
- [Abstract] Abstract: the model identifier 'Claude Opus 4.8' is not standard; a brief clarification of the exact model/version and reasoning-effort parameterization would improve precision.
Simulated Author's Rebuttal
We thank the referee for highlighting the importance of justifying the representativeness of TUA-Bench tasks. We address each major comment below and will revise the manuscript to strengthen the description of task construction while being transparent about limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the 65.8% result and observed gaps constitute a 'broad and realistic evaluation' of general-purpose TUA capabilities is load-bearing on the representativeness of the 120 tasks; the abstract supplies no information on task selection process, coverage metrics, inter-rater reliability for task design, or external validation against real-world terminal usage distributions or expert consensus, leaving open the possibility that performance ceilings reflect curation choices rather than intrinsic limitations.
Authors: We agree that the abstract would benefit from more context on task selection. In the revision we will expand the abstract to state that the 120 tasks were manually designed to cover five families spanning routine digital activities and PhD-co-designed scientific workflows, with deterministic setups and execution-based scoring. Inter-rater reliability metrics were not computed because task design was an iterative collaborative process rather than independent rating. We will also note that external validation against usage distributions was not performed. These additions will make the scope and limitations of the 'broad and realistic' phrasing clearer without overstating the evidence. revision: partial
-
Referee: [Section 3] Task construction description (Section 3 / benchmark design): while the five task families and PhD co-design are described, the manuscript provides no quantitative validation (e.g., diversity statistics, overlap with usage logs, or comparison to existing terminal corpora) that would confirm the tasks are unbiased relative to the general-purpose claim; this directly affects whether the 65.8% ceiling can be interpreted as a frontier measurement.
Authors: We will revise Section 3 to include quantitative diversity statistics, such as the breakdown of tasks by family, command categories, and estimated complexity. We will also add a limitations paragraph explicitly discussing the absence of direct comparison to public terminal usage logs or corpora. The task set was constructed through author expertise supplemented by PhD-level domain experts to target both everyday and specialized terminal activities; however, no suitable public corpora existed for quantitative overlap analysis. This revision will allow readers to evaluate the general-purpose claim more precisely while preserving the benchmark's contribution. revision: partial
- Direct quantitative overlap or comparison against real-world terminal usage distributions or existing corpora, as no appropriate public datasets were identified for such validation.
Circularity Check
No circularity: empirical benchmark with direct measurements
full rationale
The paper introduces TUA-Bench as a new set of 120 manually designed tasks with deterministic setup scripts and execution-based scoring. No equations, fitted parameters, or predictions are derived from prior quantities; reported performance (e.g., 65.8% for Claude Code) consists of direct empirical measurements on the introduced tasks. No self-citations serve as load-bearing premises for any result, and the construction does not reduce any claimed outcome to its own inputs by definition. The work is self-contained as a benchmark proposal.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
International Conference on Learning Representations , volume=
Openhands: An open platform for ai software developers as generalist agents , author=. International Conference on Learning Representations , volume=
-
[2]
Advances in Neural Information Processing Systems , volume=
Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
The twelfth international conference on learning representations , year=
Swe-bench: Can language models resolve real-world github issues? , author=. The twelfth international conference on learning representations , year=
-
[4]
The Fourteenth International Conference on Learning Representations , year=
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. The Fourteenth International Conference on Learning Representations , year=
-
[5]
Advances in Neural Information Processing Systems , volume=
macosworld: A multilingual interactive benchmark for gui agents , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
2026 , url=
Hongrui Jia and Jitong Liao and Xi Zhang and Haiyang Xu and Tianbao Xie and Chaoya Jiang and Ming Yan and Si Liu and Wei Ye and Fei Huang , booktitle=. 2026 , url=
2026
-
[7]
The Twelfth International Conference on Learning Representations , year=
WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=
-
[8]
2022 , month = nov, howpublished =
Introducing ChatGPT , author =. 2022 , month = nov, howpublished =
2022
-
[9]
2026 , month = apr, day =
Introducing. 2026 , month = apr, day =
2026
-
[10]
2025 , month = may, howpublished =
Introducing Codex , author =. 2025 , month = may, howpublished =
2025
-
[11]
2026 , howpublished =
Claude Code , author =. 2026 , howpublished =
2026
-
[12]
2026 , howpublished =
Introducing Claude Opus 4.7 , author =. 2026 , howpublished =
2026
-
[13]
2026 , howpublished =
Gemini 3.1 Pro: A smarter model for your most complex tasks , author =. 2026 , howpublished =
2026
-
[14]
2026 , howpublished =
OpenCode: A Powerful AI Coding Agent Built for the Terminal , author =. 2026 , howpublished =
2026
-
[15]
2023 , month = mar, howpublished =
Introducing Claude , author =. 2023 , month = mar, howpublished =
2023
-
[16]
2026 , howpublished =
2026
-
[17]
2025 , month = jun, howpublished =
Gemini CLI: Your Open-Source AI Agent , author =. 2025 , month = jun, howpublished =
2025
-
[18]
2021 , month = jun, howpublished =
Introducing GitHub Copilot: Your AI Pair Programmer , author =. 2021 , month = jun, howpublished =
2021
-
[19]
2026 , howpublished =
Cursor , author =. 2026 , howpublished =
2026
-
[20]
2025 , month = mar, howpublished =
Manus , author =. 2025 , month = mar, howpublished =
2025
-
[21]
2026 , howpublished =
OpenClaw: Personal AI Assistant , author =. 2026 , howpublished =
2026
-
[22]
The UNIX
Kernighan, Brian W and Mashey, John R , journal=. The UNIX. 1979 , publisher=
1979
-
[23]
2014 , publisher=
Data science at the command line: Facing the future with time-tested tools , author=. 2014 , publisher=
2014
-
[24]
Gigascience , volume=
Tools and techniques for computational reproducibility , author=. Gigascience , volume=. 2016 , publisher=
2016
-
[25]
GitHub repository , howpublished =
OpenCLI Contributors , title =. GitHub repository , howpublished =. 2026 , publisher =
2026
-
[27]
Advances in Neural Information Processing Systems , volume=
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
2026 , howpublished =
lark-cli: The Official Lark/Feishu CLI Tool , author =. 2026 , howpublished =
2026
-
[29]
2026 , howpublished =
Podman: The Best Free and Open Source Container Tools , author =. 2026 , howpublished =
2026
-
[30]
International Conference on Machine Learning , pages=
World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , organization=
2017
-
[31]
International Conference on Learning Representations , year=
Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=
-
[32]
Advances in Neural Information Processing Systems , volume=
Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=
-
[33]
2025 USENIX Annual Technical Conference (USENIX ATC 25) , pages=
The Koala benchmarks for the shell: characterization and implications , author=. 2025 USENIX Annual Technical Conference (USENIX ATC 25) , pages=
2025
-
[36]
LLM-supported natural language to bash translation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2025
-
[38]
International Conference on Learning Representations , volume=
Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery , author=. International Conference on Learning Representations , volume=
-
[41]
The Fourteenth International Conference on Learning Representations , year=
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution , author=. The Fourteenth International Conference on Learning Representations , year=
-
[44]
Windows Agent Arena: Evaluating Multi-Modal
Rogerio Bonatti and Dan Zhao and Francesco Bonacci and Dillon Dupont and Sara Abdali and Yinheng Li and Yadong Lu and Justin Wagle and Kazuhito Koishida and Arthur Bucker and Lawrence Keunho Jang and Zheng Hui , booktitle=. Windows Agent Arena: Evaluating Multi-Modal. 2025 , url=
2025
-
[45]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Webvoyager: Building an end-to-end web agent with large multimodal models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[47]
Introducing claude
Anthropic . Introducing claude. https://www.anthropic.com/news/introducing-claude, March 2023
2023
-
[48]
Claude code
Anthropic . Claude code. https://www.anthropic.com/claude-code, 2026 a
2026
-
[49]
Introducing claude opus 4.7
Anthropic . Introducing claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7, 2026 b
2026
-
[50]
Anysphere . Cursor. https://cursor.com/, 2026
2026
-
[51]
Setupbench: Assessing software engineering agents' ability to bootstrap development environments
Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. Setupbench: Assessing software engineering agents' ability to bootstrap development environments. arXiv preprint arXiv:2507.09063, 2025
arXiv 2025
-
[52]
Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories
Ivan Bercovich, Ivgeni Segal, Kexun Zhang, Shashwat Saxena, Aditi Raghunathan, and Ziqian Zhong. Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories. arXiv preprint arXiv:2604.17596, 2026
Pith/arXiv arXiv 2026
-
[53]
Windows agent arena: Evaluating multi-modal OS agents at scale
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, and Zheng Hui. Windows agent arena: Evaluating multi-modal OS agents at scale. In Forty-second International Conference on Machine Learning, 2025. https://openreview.net/forum?id=W9s817KqYf
2025
-
[54]
Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. In International Conference on Learning Representations, volume 2025, pages 96934--96990, 2025
2025
-
[55]
Terminalworld: Benchmarking agents on real-world terminal tasks
Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T Barr, Mark Harman, Federica Sarro, et al. Terminalworld: Benchmarking agents on real-world terminal tasks. arXiv preprint arXiv:2605.22535, 2026
Pith/arXiv arXiv 2026
-
[56]
OpenCLI : Make any website your CLI --- an AI -native runtime for browser automation and dynamic web data extraction
OpenCLI Contributors. OpenCLI : Make any website your CLI --- an AI -native runtime for browser automation and dynamic web data extraction. https://github.com/jackwener/opencli, 2026
2026
-
[57]
Osuniverse: Benchmark for multimodal gui-navigation ai agents
Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo M \'a rquez Flores, and Sin \'e ad Ryan. Osuniverse: Benchmark for multimodal gui-navigation ai agents. arXiv preprint arXiv:2505.03570, 2025
arXiv 2025
-
[58]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023
2023
-
[59]
Introducing github copilot: Your ai pair programmer
GitHub . Introducing github copilot: Your ai pair programmer. https://github.blog/news-insights/product-news/introducing-github-copilot-ai-pair-programmer/, June 2021
2021
-
[60]
GitHub CLI Documentation
GitHub . GitHub CLI Documentation . https://docs.github.com/en/github-cli, 2026
2026
-
[61]
Gemini cli: Your open-source ai agent
Google . Gemini cli: Your open-source ai agent. https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemini-cli-open-source-ai-agent/, June 2025
2025
-
[62]
gcloud CLI Overview
Google Cloud . gcloud CLI Overview . https://docs.cloud.google.com/sdk/gcloud, 2026
2026
-
[63]
Terminus-2: Harbor's High-Performance Reference Agent Implementation
Harbor . Terminus-2: Harbor's High-Performance Reference Agent Implementation . https://www.harborframework.com/docs/agents/terminus-2, 2026
2026
-
[64]
Harbor: A framework for evaluating and optimizing agents and models in container environments , January 2026
Harbor Framework Team . Harbor: A framework for evaluating and optimizing agents and models in container environments , January 2026. https://github.com/harbor-framework/harbor
2026
-
[65]
Webvoyager: Building an end-to-end web agent with large multimodal models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864--6890, 2024
2024
-
[66]
O'Reilly Media, Inc
Jeroen Janssens. Data science at the command line: Facing the future with time-tested tools. " O'Reilly Media, Inc.", 2014
2014
-
[67]
OSW orld- MCP : Benchmarking MCP tool invocation in computer-use agents
Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. OSW orld- MCP : Benchmarking MCP tool invocation in computer-use agents. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=rceD6wwt4B
2026
-
[68]
Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023
2023
-
[69]
The unix programming environment
Brian W Kernighan and John R Mashey. The unix programming environment. Software: Practice and Experience, 9 0 (1): 0 1--15, 1979
1979
-
[70]
Process-level trajectory evaluation for environment configuration in software engineering agents
Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, and Philip S Yu. Process-level trajectory evaluation for environment configuration in software engineering agents. arXiv preprint arXiv:2510.25694, 2025
arXiv 2025
-
[71]
The koala benchmarks for the shell: characterization and implications
Evangelos Lamprou, Ethan Williams, Georgios Kaoukis, Zhuoxuan Zhang, Michael Greenberg, Konstantinos Kallas, Lukas Lazarek, and Nikos Vasilakis. The koala benchmarks for the shell: characterization and implications. In 2025 USENIX Annual Technical Conference (USENIX ATC 25), pages 449--464, 2025
2025
-
[72]
lark-cli: The official lark/feishu cli tool
LarkSuite . lark-cli: The official lark/feishu cli tool. https://github.com/larksuite/cli, 2026
2026
-
[73]
The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon t...
2026
-
[74]
Reinforcement learning on web interfaces using workflow-guided exploration
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018. https://openreview.net/forum?id=ryTp3f-0-
2018
-
[75]
Mcp-universe: Benchmarking large language models with real-world model context protocol servers
Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. Mcp-universe: Benchmarking large language models with real-world model context protocol servers. arXiv preprint arXiv:2508.14704, 2025
arXiv 2025
-
[76]
Manus . Manus. https://manus.im/, March 2025
2025
-
[77]
Mike A Merrill, Alexander Glenn Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap...
2026
-
[78]
Kimi Code : Next-gen ai code agent
Moonshot AI . Kimi Code : Next-gen ai code agent. https://www.kimi.com/code, 2026
2026
-
[79]
Introducing chatgpt
OpenAI . Introducing chatgpt. https://openai.com/index/chatgpt/, November 2022
2022
-
[80]
Introducing codex
OpenAI . Introducing codex. https://openai.com/index/introducing-codex/, May 2025
2025
-
[81]
Introducing GPT-5.5
OpenAI . Introducing GPT-5.5 . https://openai.com/index/introducing-gpt-5-5/, April 2026
2026
-
[82]
Openclaw: Personal ai assistant
OpenClaw . Openclaw: Personal ai assistant. https://openclaw.ai/, 2026
2026
-
[83]
Opencode: A powerful ai coding agent built for the terminal
OpenCode . Opencode: A powerful ai coding agent built for the terminal. https://github.com/opencode-ai/opencode, 2026
2026
-
[84]
Gdpval: Evaluating ai model performance on real-world economically valuable tasks
Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Sim \'o n Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks. arXiv preprint arXiv:2510.04374, 2025
Pith/arXiv arXiv 2025
-
[85]
Tools and techniques for computational reproducibility
Stephen R Piccolo and Michael B Frampton. Tools and techniques for computational reproducibility. Gigascience, 5 0 (1): 0 s13742--016, 2016
2016
-
[86]
Podman: The best free and open source container tools
Podman Container Tools . Podman: The best free and open source container tools. https://podman.io/, 2026
2026
-
[87]
Qwen Code : An open-source ai coding agent that lives in your terminal
QwenLM . Qwen Code : An open-source ai coding agent that lives in your terminal. https://github.com/QwenLM/qwen-code, 2026
2026
-
[88]
World of bits: An open-domain platform for web-based agents
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135--3144. PMLR, 2017
2017
-
[89]
Slack CLI
Slack . Slack CLI . https://docs.slack.dev/tools/slack-cli/, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.