pith. sign in

arxiv: 2606.03203 · v1 · pith:GVCPU4AOnew · submitted 2026-06-02 · 💻 cs.AI

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

Pith reviewed 2026-06-28 10:11 UTC · model grok-4.3

classification 💻 cs.AI
keywords clinical computer-use agentsmedical GUI benchmarkAI agents for healthcarescreenshot-based evaluationOpenEMRsafety dimensionsclinical scenarios
0
0 comments X

The pith

A benchmark of 18 clinical scenarios shows top AI agents complete only 54 percent of medical interface tasks and under 9 percent on real OpenEMR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedCUA-Bench to measure how well computer-use agents handle medical graphical interfaces that demand domain knowledge and safety checks. It reconstructs 18 scenarios across 10 domains from product manuals and open-source systems, each with intent-level and step-level goals plus a checker for completion and five safety dimensions. Tests on 23 agents find closed-source models top out at 54.2 percent strict success while open-source models average 2.5 percent, with every model below 9 percent on the actual OpenEMR system. The results establish that current agents fall short of reliable clinical use.

Core claim

MedCUA-Bench demonstrates that existing agents cannot reliably operate clinical software because medical interfaces differ in design, require specialized knowledge, and impose safety constraints that general web or desktop agents do not face, as shown by the low success rates across all tested models on both reconstructed and real systems.

What carries the argument

MedCUA-Bench, a screenshot-only interactive benchmark that supplies paired intent- and step-level goals and evaluates both task completion and five clinical safety dimensions on 18 scenarios reconstructed from real medical systems.

If this is right

  • Agents must combine clinical reasoning with precise UI execution to reach usable reliability.
  • Safety checks beyond task completion are required for any deployment in medical settings.
  • Reproducible testbeds like this one enable targeted progress on domain-specific interfaces.
  • Open-source agents lag closed-source ones by a wide margin on these tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • General-purpose agents trained on everyday web tasks will require substantial adaptation before they can handle specialized professional software.
  • The gap between reconstructed and real-system performance points to missing real-world variability that future benchmarks should add.
  • Low scores on intent-level goals suggest the main barrier is clinical understanding rather than pure screen navigation.

Load-bearing premise

The 18 scenarios built from product manuals and open-source medical systems accurately reflect the interfaces, knowledge needs, and safety rules of actual clinical software without missing constraints or adding artifacts.

What would settle it

Run the same 23 agents on a fresh installation of a medical system not used in the benchmark reconstruction and compare success rates and safety scores to those reported for OpenEMR.

Figures

Figures reproduced from arXiv: 2606.03203 by Dongsheng Li, Jia Yu, Shuo Wang, Xinyang Jiang, Zilong Wang.

Figure 1
Figure 1. Figure 1: Why medical GUI agents require a dedicated benchmark. General-purpose GUI benchmarks lack realistic [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MedCUA. Clinicians construct environments spanning ten domains and provide two goal [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall strict success rate by model. Closed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Success rate by page-fidelity tier. OpenEMR flattens every model to single digits, while the OHIF imaging [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-model success rate under the two goal granularities. Each pair of bars shares the same underlying 216 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Episode outcome decomposition. Most fail [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Recorded non-critical safety violations by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MedCUA-Bench, a screenshot-only interactive benchmark for clinical computer-use agents consisting of 18 scenarios across 10 medical domains. Scenarios are reconstructed from product manuals and open-source systems to avoid licensing/privacy issues. Each task includes paired intent- and step-level goals and is scored by a deterministic checker on task completion plus five clinical safety dimensions. Evaluation of 23 agents shows the best closed-source model at 54.2% strict success, all models below 9% on real OpenEMR, and open-source agents averaging 2.5% (best 16.2%). The work positions the benchmark as exposing a reliability gap for future research.

Significance. If the reconstruction and checker are shown to be faithful, the benchmark would be a valuable contribution by supplying the first dedicated, reproducible testbed for clinical GUI agents in a domain where safety constraints and domain knowledge differ sharply from general web/desktop tasks. The explicit separation of intent vs. execution goals and the multi-dimensional safety scoring are positive design choices that could support targeted progress.

major comments (3)
  1. [§3] §3 (Benchmark Construction): No validation protocol is described for confirming that the 18 reconstructed scenarios preserve authentic UI state transitions, domain-specific validation rules, and safety edge cases from the source manuals and systems. Without side-by-side expert review or failure-mode coverage metrics, the headline claim that the benchmark exposes a genuine clinical gap cannot be assessed.
  2. [§4] §4 (Evaluation Protocol): The implementation of the deterministic checker and the precise operational definitions of the five clinical safety dimensions are not supplied. This directly affects interpretability of the strict-success metric and the reported 54.2% / <9% figures.
  3. [§5] §5 (Agent Evaluation): Selection criteria, prompting templates, and execution harness for the 23 agents are not detailed, so the comparability of closed-source vs. open-source results and the reproducibility of the 2.5% / 16.2% open-source numbers cannot be verified.
minor comments (2)
  1. [Table 2] Table 2: Column headers for safety dimensions should explicitly reference the definitions introduced in §4.1 to avoid ambiguity.
  2. [Figure 3] Figure 3: The OpenEMR comparison bar chart would be clearer with error bars or per-agent breakdowns rather than aggregate averages only.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments and for recognizing the potential value of MedCUA-Bench as a dedicated testbed. We address each major comment point-by-point below, with planned revisions to improve transparency and reproducibility.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): No validation protocol is described for confirming that the 18 reconstructed scenarios preserve authentic UI state transitions, domain-specific validation rules, and safety edge cases from the source manuals and systems. Without side-by-side expert review or failure-mode coverage metrics, the headline claim that the benchmark exposes a genuine clinical gap cannot be assessed.

    Authors: We agree that the current manuscript lacks an explicit validation protocol description. In the revised version we will add a dedicated subsection to §3 that documents the reconstruction and verification process, including how UI state transitions, domain rules, and safety edge cases were checked against the source manuals and open-source systems. This will include the internal review steps performed during construction. revision: yes

  2. Referee: [§4] §4 (Evaluation Protocol): The implementation of the deterministic checker and the precise operational definitions of the five clinical safety dimensions are not supplied. This directly affects interpretability of the strict-success metric and the reported 54.2% / <9% figures.

    Authors: The operational definitions and checker logic are essential for interpretability. We will expand §4 in the revision to provide the precise definitions of the five clinical safety dimensions and a detailed description (including pseudocode) of the deterministic checker. We will also release the checker implementation as supplementary open-source code to enable full reproducibility of the reported metrics. revision: yes

  3. Referee: [§5] §5 (Agent Evaluation): Selection criteria, prompting templates, and execution harness for the 23 agents are not detailed, so the comparability of closed-source vs. open-source results and the reproducibility of the 2.5% / 16.2% open-source numbers cannot be verified.

    Authors: We acknowledge that greater detail is required for reproducibility. The revised §5 will specify the selection criteria for the 23 agents, include the full prompting templates, and describe the execution harness in sufficient detail to allow independent replication of the closed-source and open-source comparisons. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or fitted predictions exhibits no circularity.

full rationale

The manuscript introduces MedCUA-Bench as an empirical testbed for clinical agents, reporting success rates (e.g., 54.2% closed-source strict success) from direct evaluation on 18 reconstructed scenarios. No equations, parameter fits, predictions derived from subsets of data, or load-bearing self-citations appear in the abstract or described structure. All performance claims rest on explicit experimental runs against the benchmark and real OpenEMR, without any reduction of results to inputs by construction. This is the expected outcome for a pure benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that reconstructed scenarios from manuals faithfully represent real clinical software challenges and safety needs.

axioms (1)
  • domain assumption Reconstructed interfaces from product manuals and open-source systems capture authentic clinical UIs, domain knowledge, and safety requirements
    Invoked in abstract as the basis for benchmark validity and to avoid licensing/privacy constraints.

pith-pipeline@v0.9.1-grok · 5747 in / 1214 out tokens · 36053 ms · 2026-06-28T10:11:40.623448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 4 canonical work pages · 4 internal anchors

  1. [1]

    2024 , note =

    Hong, Wenyi and Wang, Weihan and Lv, Qingsong and Xu, Jiazheng and Yu, Wenmeng and Ji, Junhui and Wang, Yan and Wang, Zihan and Zhang, Yuxuan and Lai, Hanyu and others , booktitle =. 2024 , note =

  2. [2]

    2024 , note =

    Zheng, Boyuan and Gou, Boyu and Kil, Jihyung and Sun, Huan and Su, Yu , booktitle =. 2024 , note =

  3. [3]

    2024 , doi =

    Niu, Runliang and Li, Jindong and Wang, Shiqi and Fu, Yali and Hu, Xiyu and Leng, Xueyuan and Kong, He and Chang, Yi and Wang, Qi , booktitle =. 2024 , doi =

  4. [4]

    Proceedings of the 34th International Conference on Machine Learning (ICML) , series =

    World of Bits: An Open-Domain Platform for Web-Based Agents , author =. Proceedings of the 34th International Conference on Machine Learning (ICML) , series =

  5. [5]

    International Conference on Learning Representations (ICLR) , year =

    Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , author =. International Conference on Learning Representations (ICLR) , year =

  6. [6]

    2022 , note =

    Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =. 2022 , note =

  7. [7]

    2023 , note =

    Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Samuel and Wang, Boshi and Sun, Huan and Su, Yu , booktitle =. 2023 , note =

  8. [8]

    Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =. 2024 , note =

  9. [9]

    2024 , note =

    Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yu and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel , booktitle =. 2024 , note =

  10. [10]

    2024 , note =

    Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and others , booktitle =. 2024 , note =

  11. [11]

    Bonatti, Rogerio and Zhao, Dan and Bonacci, Francesco and Dupont, Dillon and Abdali, Sara and Li, Yinheng and Lu, Yadong and Wagle, Justin and Koishida, Kazuhito and Bucker, Arthur and others , journal =

  12. [12]

    and Del Verme, Manuel and Marty, Tom and Boisvert, L

    Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Boisvert, L. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

  13. [13]

    2025 , eprint=

    The BrowserGym Ecosystem for Web Agent Research , author=. 2025 , eprint=

  14. [14]

    The Fourteenth International Conference on Learning Representations , year=

    ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows , author=. The Fourteenth International Conference on Learning Representations , year=

  15. [15]

    2024 , note =

    Ma, Zeyao and Zhang, Bohan and Zhang, Jing and Yu, Jifan and Zhang, Xiaokang and Zhang, Xiaohan and Luo, Sijia and Wang, Xi and Tang, Jie , booktitle =. 2024 , note =

  16. [16]

    Schmidgall, Samuel and Ziaei, Rojin and Harris, Carl and Reis, Eduardo and Jopling, Jeffrey and Moor, Michael , journal =

  17. [17]

    and Geng, Gloria and Park, Danny and Zou, James and Ng, Andrew Y

    Jiang, Yixing and Black, Kameron C. and Geng, Gloria and Park, Danny and Zou, James and Ng, Andrew Y. and Chen, Jonathan H. , journal =

  18. [18]

    and Anwar, Zain and Sarfo-Gyamfi, Maame and Safranek, Conrad W

    Khandekar, Nikhil and Jin, Qiao and Xiong, Guangzhi and Dunn, Soren and Applebaum, Serina S. and Anwar, Zain and Sarfo-Gyamfi, Maame and Safranek, Conrad W. and Anwar, Abid Ayaz and Zhang, Andrew and others , booktitle =. 2024 , note =

  19. [19]

    2026 , eprint=

    MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI , author=. 2026 , eprint=

  20. [20]

    2024 , howpublished =

  21. [21]

    Annals of Internal Medicine , volume =

    Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties , author =. Annals of Internal Medicine , volume =. 2016 , doi =

  22. [22]

    and Beasley, John W

    Arndt, Brian G. and Beasley, John W. and Watkinson, Michelle D. and Temte, Jonathan L. and Tuan, Wen-Jan and Sinsky, Christine A. and Gilchrist, Valerie J. , journal =. Tethered to the. 2017 , doi =

  23. [23]

    Mayo Clinic Proceedings , volume =

    Relationship Between Clerical Burden and Characteristics of the Electronic Environment With Physician Burnout and Professional Satisfaction , author =. Mayo Clinic Proceedings , volume =. 2016 , doi =

  24. [24]

    and Zhang, Hao and Stoica, Ion , booktitle =

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , note =

  25. [25]

    2026 , eprint=

    HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks , author=. 2026 , eprint=

  26. [26]

    2026 , howpublished =

  27. [27]

    arXiv preprint arXiv:2507.20534 , year =

  28. [28]

    Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and others , journal =

  29. [29]

    Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others , journal =

  30. [30]

    arXiv preprint arXiv:2503.19786 , year =

  31. [31]

    arXiv preprint arXiv:2504.07491 , year =

  32. [32]

    2025 , howpublished =

  33. [33]

    Structured Distillation of Web Agent Capabilities Enables Generalization

    Structured Distillation of Web Agent Capabilities Enables Generalization , author =. arXiv preprint arXiv:2604.07776 , year =

  34. [34]

    Wang, Xinyuan and Wang, Bowen and Lu, Dunjie and Yang, Junlin and Xie, Tianbao and Wang, Junli and Deng, Jiaqi and Guo, Xiaole and Xu, Yiheng and Wu, Chen Henry and others , journal =