PRO-CUA: Process-Reward Optimization for Computer Use Agents

Han Zhao; Hao Bai; Rui Yang; Tong Zhang; Yifei He

arxiv: 2605.29119 · v1 · pith:PEQSFWR4new · submitted 2026-05-27 · 💻 cs.AI

PRO-CUA: Process-Reward Optimization for Computer Use Agents

Yifei He , Rui Yang , Hao Bai , Tong Zhang , Han Zhao This is my paper

Pith reviewed 2026-06-29 11:44 UTC · model grok-4.3

classification 💻 cs.AI

keywords computer use agentsprocess reward optimizationreinforcement learningGUI interactionstep-level feedbackon-policy trainingdistribution shift

0 comments

The pith

PRO-CUA enables training of computer use agents through step-level reinforcement learning using a process reward model on the agent's own live states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PRO-CUA to overcome limitations in training computer use agents that automate digital workflows. Current methods either rely on costly expert demonstrations that cause distribution shift or use sparse trajectory rewards that make credit assignment difficult. PRO-CUA has the policy collect states in live rollouts, generate candidate actions, score them with a process reward model, and optimize using group-relative advantages. This provides dense feedback without needing golden answers or offline trajectories. A sympathetic reader would care because it could make training scalable for complex GUI tasks on web benchmarks.

Core claim

The central claim is that decoupling on-policy environment interaction from policy optimization, combined with step-level feedback from a process reward model on diverse candidate actions, enables dense and flexible credit assignment while reducing distribution shift by training on the agent's own execution states.

What carries the argument

The process reward model (PRM) that supplies step-level quality signals for ranking actions in GUI states, used with group-relative advantages for optimization.

If this is right

Training does not depend on golden answers or offline expert trajectories.
Distribution shift is reduced because optimization uses the agent's own states from live rollouts.
Credit assignment becomes dense and flexible for long-horizon tasks.
The approach demonstrates effectiveness on live web benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the PRM generalizes well, the method could extend to other interactive agent domains like robotics or software testing.
Reliable step-level rewards might lower the infrastructure costs associated with long-horizon interactions.
This framework suggests a path to iterative improvement of agents without constant human supervision.

Load-bearing premise

The process reward model must supply reliable, generalizable step-level quality signals that correctly rank actions in live, previously unseen GUI states.

What would settle it

Observing that actions ranked highly by the PRM do not lead to better task completion rates than lower-ranked ones in new environments would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.29119 by Han Zhao, Hao Bai, Rui Yang, Tong Zhang, Yifei He.

**Figure 1.** Figure 1: Overview of the PRO-CUA pipeline. PRO-CUA alternates between two stages across multiple training iterations. In Stage 1, the current policy interacts with the live environment to collect on-policy states. In Stage 2, policy optimization is performed without further environment interaction through three steps: i) Step-level generation: The agent samples multiple candidate actions for each collected state; i… view at source ↗

**Figure 2.** Figure 2: Process Reward Model (PRM) grading pipeline. The PRM receives a multimodal step context comprising the task instruction, the agent’s action history, the proposed current action, and an annotated screenshot. For readability, the figure shows a zoomedin crop, while the actual PRM input contains the full web interface. Based on this augmented context, the PRM generates a reasoning trace to assess whether th… view at source ↗

**Figure 3.** Figure 3: Step-level rewards assigned during training with moving average. GPT-5-mini assigns more conservative rewards, while Qwen3-VL-4B is more lenient on average. Despite this calibration gap, both PRMs achieve similar downstream policy performance, suggesting that GRPO is robust to differences in reward strictness through group normalization. with traditional rule-based rewards. To isolate the effect of the r… view at source ↗

**Figure 4.** Figure 4: Data utilization across training iterations. PRO-CUA consistently yields more usable step-level training data than FBC and rule-based Step-RL because process rewards allow learning from both successful and failed finished trajectories, while the baselines rely on successful rollouts. ing step-level reinforcement learning with GRPO. This suggests that the key advantage of PRO-CUA comes from the reward sourc… view at source ↗

read the original abstract

Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRO-CUA sketches an on-policy PRM pipeline for step-level credit in CUAs but leaves the key generalization claim untested in the given text.

read the letter

The main thing here is a pipeline that collects live rollouts from the current policy, generates candidate actions per state, scores them with a process reward model, and optimizes using group-relative advantages. This is framed as fixing imitation shift and sparse credit assignment without golden labels.

The paper does a clear job naming the practical bottlenecks for computer use agents: behavior cloning drifts from expert data, and trajectory RL gives weak signals over long GUI sequences. The decoupling of interaction from optimization and the on-policy focus are reasonable engineering choices that match the stated goals.

The soft spot is exactly the one in the stress-test note. Everything rests on the PRM supplying reliable rankings on states the policy actually produces in new environments. The abstract mentions using the PRM for feedback and claims experiments on live web benchmarks, but supplies no training data description, objective, or validation for the PRM itself. Without that, it is impossible to know whether the method removes the expert-data requirement or simply relocates it. No equations for advantage computation or experimental controls appear either, so the soundness score from the reader report looks accurate.

This is aimed at people working on scaling interactive agents rather than core theory. A reader already building CUA systems could pull the high-level design for discussion, but the version supplied is too thin for implementation or strong claims.

I would bring it to a reading group as a prompt for talking about PRM use in agents. I would not cite it yet. It deserves peer review if the full paper adds the missing PRM details and results, since the target problem is real and the proposed structure is coherent even if the evidence is still missing.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes PRO-CUA, a process-reward optimization framework for training computer use agents (CUAs). The method decouples live environment interaction from optimization: the current policy performs rollouts to collect states, generates multiple candidate actions per state, obtains step-level scores from a process reward model (PRM), and is updated via group-relative advantages. The central claims are that this yields dense, flexible credit assignment without golden answers or offline expert trajectories and reduces distribution shift by training exclusively on the agent's own execution states. Experiments on live web benchmarks are reported to confirm effectiveness and PRM reliability.

Significance. If the PRM supplies accurate, generalizable step-level rankings on previously unseen GUI states generated by the current policy, the framework would address key bottlenecks in CUA training—imitation bottlenecks, sparse rewards, and distribution shift—while lowering the cost of high-quality supervision. The on-policy, PRM-guided design with group-relative advantages is a concrete technical contribution that could be adopted more broadly if the generalization assumption holds.

major comments (2)

[Abstract / Method] Abstract and method description: the claim that PRO-CUA trains 'without relying on golden answers or offline expert trajectories' cannot be evaluated because no information is supplied on the PRM's training data, objective, or validation set. If PRM training itself requires expert step labels or golden trajectories, the stated benefit is not realized and the distribution-shift reduction is only partial.
[Abstract] Abstract: the assertion that the PRM supplies 'reliable' step-level feedback on live rollouts rests on an untested generalization assumption. No ablation, hold-out evaluation on policy-generated states, or comparison against expert-labeled baselines is described, making it impossible to determine whether credit assignment is actually dense and correct or merely noisy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the clarity of our claims regarding PRM training and generalization. We address the major comments point-by-point below.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the claim that PRO-CUA trains 'without relying on golden answers or offline expert trajectories' cannot be evaluated because no information is supplied on the PRM's training data, objective, or validation set. If PRM training itself requires expert step labels or golden trajectories, the stated benefit is not realized and the distribution-shift reduction is only partial.

Authors: The core claim in the abstract refers to the PRO-CUA optimization loop itself: the policy collects its own on-policy states, samples candidate actions, and is updated using only PRM scores via group-relative advantages, without access to golden answers or expert trajectories at optimization time. This is distinct from any offline data used to train the PRM in a separate stage. We agree that the abstract and method section lack sufficient detail on the PRM's training data, objective, and validation, which prevents full evaluation of the claim. We will revise the manuscript to add a dedicated subsection describing the PRM training procedure, data sources, and how the on-policy phase achieves the stated benefits. revision: yes
Referee: [Abstract] Abstract: the assertion that the PRM supplies 'reliable' step-level feedback on live rollouts rests on an untested generalization assumption. No ablation, hold-out evaluation on policy-generated states, or comparison against expert-labeled baselines is described, making it impossible to determine whether credit assignment is actually dense and correct or merely noisy.

Authors: The experiments on live web benchmarks include quantitative results demonstrating PRO-CUA effectiveness and supporting PRM reliability through end-to-end performance gains. However, we acknowledge that the manuscript does not present explicit ablations, hold-out evaluations specifically on policy-generated states, or direct comparisons against expert-labeled baselines for the PRM. We will add these analyses in the revision, including hold-out tests on states sampled from the current policy and any feasible comparisons to expert annotations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation treats PRM as external and makes no self-referential reductions

full rationale

The paper claims PRO-CUA achieves dense credit assignment and reduced distribution shift by collecting states via live rollouts from the current policy, generating candidate actions, and optimizing with group-relative advantages from PRM feedback. No equations, fitted parameters, or predictions are shown that reduce these benefits to the inputs by construction. The PRM is presented as an external component supplying step-level signals, with no description of its training that would create a self-definitional loop or fitted-input-called-prediction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is smuggled in. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that a trainable PRM can be obtained and will generalize.

pith-pipeline@v0.9.1-grok · 5725 in / 887 out tokens · 26968 ms · 2026-06-29T11:44:02.246267+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 5 canonical work pages · 1 internal anchor

[1]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

work page internal anchor Pith review Pith/arXiv arXiv
[2]

InInternational Conference on Learning Representations, volume 2024, pages 39578–39601

Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601. Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, and 1 others. 2025. Cuareward- bench: A benchmark for evaluating reward mod- els on computer-using agent.arXiv preprint arXiv:2510...

work page arXiv 2024
[3]

InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–

A reduction of imitation learning and struc- tured prediction to no-regret online learning. InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–
[4]

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar

JMLR Workshop and Conference Proceedings. Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. 2025. Rewarding progress: Scaling automated process veri- fiers for llm reasoning. InInternational Conference on Learning Representations, volume 2025, pages 60808–60838. Jun...

work page arXiv 2025
[5]

Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

Math-shepherd: Verify and reinforce llms step- by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439. Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, and 1 others. 202...

work page arXiv 2025
[6]

Gui-pra: Process reward agent for gui tasks

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, and Shengyu Zhang. 2025. Gui-pra: Process reward agent for gui tasks.arXiv preprint arXiv:2509.23263. Yih...

work page arXiv 2025
[7]

Yaowei Zheng, Richong Zhang, Junhao Zhang, YeYan- han YeYanhan, and Zheyan Luo

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Yaowei Zheng, Richong Zhang, Junhao Zhang, YeYan- han YeYanhan, and Zheyan Luo. 2024. Llamafactory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the Association for Computationa...

2024
[8]

The overarching task instruction
[9]

The history of actions taken so far
[10]

The CURRENT screenshot (the state immediately BEFORE the proposed action), annotated to show the proposed target of the action
[11]

mental rollout

The proposed Thought and Action Code. The screenshot is an annotated visualization of the proposed action, not a raw screenshot: - Red marks, arrows, or points indicate where the proposed action is targeting. - Small index labels and overlay text are part of the annotation. - Use these annotations to judge whether the proposed action is correctly grounded...
[12]

[Current State Assessment]: What is currently visible on the screen? What is the immediate blocker to completing the task?
[13]

[Target Verification]: Does the proposed code correctly and accurately target the intended UI element in the screenshot?
[14]

A dropdown menu will appear,

[Mental Rollout]: If this exact code is executed, what will happen? (e.g., "A dropdown menu will appear," "The page will scroll down," "The text ’shoes’ will be typed")
[15]

[Task Alignment]: Does this predicted outcome meaningfully and efficiently advance the task? Or is it a redundant/wasteful action given the history?
[16]

‘json "is_correct

[Final Verdict]: Conclude whether the step is Correct or Incorrect. </analysis_process> “‘json "is_correct": boolean, "reflection": "A 1-2 sentence summary of why the action was marked correct or incorrect." """ 13

[1] [1]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

InInternational Conference on Learning Representations, volume 2024, pages 39578–39601

Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601. Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, and 1 others. 2025. Cuareward- bench: A benchmark for evaluating reward mod- els on computer-using agent.arXiv preprint arXiv:2510...

work page arXiv 2024

[3] [3]

InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–

A reduction of imitation learning and struc- tured prediction to no-regret online learning. InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–

[4] [4]

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar

JMLR Workshop and Conference Proceedings. Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. 2025. Rewarding progress: Scaling automated process veri- fiers for llm reasoning. InInternational Conference on Learning Representations, volume 2025, pages 60808–60838. Jun...

work page arXiv 2025

[5] [5]

Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

Math-shepherd: Verify and reinforce llms step- by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439. Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, and 1 others. 202...

work page arXiv 2025

[6] [6]

Gui-pra: Process reward agent for gui tasks

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, and Shengyu Zhang. 2025. Gui-pra: Process reward agent for gui tasks.arXiv preprint arXiv:2509.23263. Yih...

work page arXiv 2025

[7] [7]

Yaowei Zheng, Richong Zhang, Junhao Zhang, YeYan- han YeYanhan, and Zheyan Luo

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Yaowei Zheng, Richong Zhang, Junhao Zhang, YeYan- han YeYanhan, and Zheyan Luo. 2024. Llamafactory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the Association for Computationa...

2024

[8] [8]

The overarching task instruction

[9] [9]

The history of actions taken so far

[10] [10]

The CURRENT screenshot (the state immediately BEFORE the proposed action), annotated to show the proposed target of the action

[11] [11]

mental rollout

The proposed Thought and Action Code. The screenshot is an annotated visualization of the proposed action, not a raw screenshot: - Red marks, arrows, or points indicate where the proposed action is targeting. - Small index labels and overlay text are part of the annotation. - Use these annotations to judge whether the proposed action is correctly grounded...

[12] [12]

[Current State Assessment]: What is currently visible on the screen? What is the immediate blocker to completing the task?

[13] [13]

[Target Verification]: Does the proposed code correctly and accurately target the intended UI element in the screenshot?

[14] [14]

A dropdown menu will appear,

[Mental Rollout]: If this exact code is executed, what will happen? (e.g., "A dropdown menu will appear," "The page will scroll down," "The text ’shoes’ will be typed")

[15] [15]

[Task Alignment]: Does this predicted outcome meaningfully and efficiently advance the task? Or is it a redundant/wasteful action given the history?

[16] [16]

‘json "is_correct

[Final Verdict]: Conclude whether the step is Correct or Incorrect. </analysis_process> “‘json "is_correct": boolean, "reflection": "A 1-2 sentence summary of why the action was marked correct or incorrect." """ 13