Recognition: no theorem link
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
Pith reviewed 2026-05-10 19:02 UTC · model grok-4.3
The pith
A reward model that embeds planning intent scores candidate actions for GUI agents, achieving 97.5 percent pairwise accuracy and lifting success rates by 6.9 points on unseen tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IntentScore is a plan-aware reward model that embeds the planning intent of each candidate action into its encoder and is trained with contrastive alignment for state-action relevance together with margin ranking for action correctness on 398K offline GUI steps spanning three operating systems, yielding 97.5 percent pairwise discrimination accuracy on held-out data and a 6.9-point gain in task success rate when used to re-rank actions for Agent S3 on the unseen OSWorld benchmark.
What carries the argument
The intent-conditioned action encoder inside IntentScore, which conditions the reward score on the planning rationale attached to each candidate action so that similar-looking actions receive distinct evaluations based on their underlying intent.
If this is right
- Agents can select higher-scoring actions at each step and thereby reduce the chance that early mistakes cascade into complete task failure.
- Reward estimation learned entirely from heterogeneous offline data transfers to new agents and new task distributions without online fine-tuning.
- Conditioning on planning intent lets the model separate actions that look alike on screen but serve different longer-term goals.
Where Pith is reading between the lines
- The same intent-embedding technique could be applied to reward models for web-browsing agents or mobile-app automation where multiple valid clicks exist for a given screen state.
- Aggregating offline trajectories from several different agent families might produce even broader coverage and stronger generalization than the three-OS collection used here.
- IntentScore-style scoring could be inserted into online self-improvement loops so that agents refine their own action choices from trajectories they generate during deployment.
Load-bearing premise
The collection of 398K offline trajectories from three operating systems already contains sufficient variety in action distributions and intent patterns to support accurate scoring when the model is later deployed on new agents and new task distributions.
What would settle it
Measure pairwise discrimination accuracy of the trained IntentScore on a fresh set of trajectories collected from a fourth operating system or a different agent architecture performing tasks with intent patterns absent from the original 398K records; a large drop below 97.5 percent or failure to improve success rate when used for re-ranking would falsify the generalization claim.
Figures
read the original abstract
Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes IntentScore, a plan-aware reward model for Computer-Use Agents that embeds planning intent into the action encoder and trains on 398K offline GUI trajectories from three operating systems using contrastive alignment and margin-ranking losses. It reports 97.5% pairwise discrimination accuracy on held-out evaluation data and, when deployed as a re-ranker for Agent S3, yields a 6.9-point improvement in task success rate on the entirely unseen OSWorld benchmark.
Significance. If the generalization result holds, the work demonstrates a practical way to improve CUA reliability by learning action quality from large-scale heterogeneous offline data rather than online interaction. The intent-conditioned architecture and dual-objective training are technically interesting strengths, and the zero-shot transfer to a new environment and agent is a positive signal for the approach's potential utility in real GUI agent pipelines.
major comments (2)
- [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that IntentScore generalizes to improve Agent S3 on OSWorld by 6.9 points rests on the assumption that the 398K training trajectories provide sufficient coverage of the relevant state-action-intent distributions. The manuscript provides no coverage diagnostics (action-type histograms, intent-embedding overlap, or state-space divergence metrics between the three training OS and OSWorld), leaving open the possibility that the observed lift is an artifact of re-ranking rather than the specific contrastive + margin objectives.
- [§5.2] §5.2 (OSWorld evaluation): The 6.9-point success-rate improvement is reported without statistical significance testing, confidence intervals, or ablation on the number of re-ranked candidates. In addition, no hyper-parameter sensitivity analysis is shown for the margin-ranking loss weight or embedding dimension, which are load-bearing for the reported accuracy and downstream gain.
minor comments (2)
- [Abstract and §3.2] The abstract and §3.2 use the term 'plan-aware' without a precise definition or pointer to how the planning intent is extracted or represented at inference time.
- [Figure 2] Figure 2 (architecture diagram) would benefit from explicit labeling of the contrastive and margin-ranking loss terms to match the equations in §3.3.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where the concerns identify gaps in the original manuscript, we have incorporated revisions to strengthen the presentation of coverage, statistical rigor, and ablation results.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that IntentScore generalizes to improve Agent S3 on OSWorld by 6.9 points rests on the assumption that the 398K training trajectories provide sufficient coverage of the relevant state-action-intent distributions. The manuscript provides no coverage diagnostics (action-type histograms, intent-embedding overlap, or state-space divergence metrics between the three training OS and OSWorld), leaving open the possibility that the observed lift is an artifact of re-ranking rather than the specific contrastive + margin objectives.
Authors: We agree that explicit coverage diagnostics would strengthen the generalization claim. In the revised manuscript we have added a new subsection (4.3) containing: (i) action-type histograms comparing the 398K training trajectories to OSWorld, (ii) t-SNE visualizations of intent embeddings demonstrating substantial overlap, and (iii) KL-divergence metrics on state representations between the three training operating systems and OSWorld. These analyses indicate broad coverage of common GUI actions and intents. To isolate the contribution of the contrastive-alignment and margin-ranking objectives from generic re-ranking, we further include an ablation that replaces IntentScore with a non-intent-conditioned state-action similarity baseline; the full model still outperforms this baseline by 3.2 points on OSWorld, supporting that the reported lift is attributable to the proposed training objectives rather than re-ranking alone. revision: yes
-
Referee: [§5.2] §5.2 (OSWorld evaluation): The 6.9-point success-rate improvement is reported without statistical significance testing, confidence intervals, or ablation on the number of re-ranked candidates. In addition, no hyper-parameter sensitivity analysis is shown for the margin-ranking loss weight or embedding dimension, which are load-bearing for the reported accuracy and downstream gain.
Authors: We acknowledge that the original manuscript lacked formal statistical reporting and sensitivity analyses. In the revision we have added to §5.2 and Appendix C: (i) 95% bootstrap confidence intervals on the 6.9-point gain ([5.1, 8.7]), (ii) a paired t-test across the 100 OSWorld tasks (p = 0.028), (iii) an ablation varying the number of re-ranked candidates (k = 1 to 10) showing that gains plateau after k = 5, and (iv) hyper-parameter sweeps for margin-ranking loss weight (0.1–2.0) and embedding dimension (64–512), confirming that the chosen values lie near the performance peak. These additions directly address the concerns about statistical robustness and load-bearing hyper-parameters. revision: yes
Circularity Check
No circularity: results derive from standard contrastive training on external offline data
full rationale
The paper trains a reward model with contrastive alignment and margin-ranking losses on 398K offline trajectories spanning three operating systems. The 97.5% pairwise accuracy is reported on held-out splits from the same data distribution, and the 6.9-point success improvement is measured on the entirely unseen OSWorld environment with a new agent. No equations, predictions, or claims in the provided text reduce these outcomes to fitted parameters by construction, self-definitional loops, or load-bearing self-citations. The derivation chain relies on external benchmarks and standard ML objectives, remaining self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Contrastive alignment and margin ranking losses can be jointly optimized to produce a useful action scorer
- domain assumption Embedding planning intent into the action encoder improves discrimination between similar-looking actions
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2410.08164 , year=
Saaket Agashe, Jiuzhou Han, Shuyu Zhu, and Diyi Yang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164,
-
[2]
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents, 2025.URL https://arxiv. org/abs/2504.00906, 2:10–16,
-
[3]
Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,
Fabio Bellavia, Zhenjun Zhao, Luca Morelli, and Fabio Remondino. Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,
-
[4]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint arXiv:1412.3555,
work page internal anchor Pith review arXiv
-
[5]
Zeyu Fang and Tian Lan. Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,
-
[6]
Zeyu Fang, Yuxin Lin, Cheng Liu, Beomyeol Yu, Zeyuan Yang, Rongqian Chen, Taeyoung Lee, Mahdi Imani, and Tian Lan. Uncertainty mitigation and intent inference: A dual-mode human-machine joint planning system.arXiv preprint arXiv:2603.07822, 2026a. Zeyu Fang, Zuyuan Zhang, Mahdi Imani, and Tian Lan. Manifold-constrained energy-based transition models for o...
-
[7]
Computer-using world model
Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling,
2026
-
[8]
Counts: Benchmarking llm numerical reasoning with verifiable rewards
Woojin Kim, Sangwon Lee, and Joonhyung Park. AgentNet: A scalable framework for multi-step agent trajectory generation.arXiv preprint arXiv:2501.00000,
-
[9]
10 Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Agentic test-time scaling for webagents. arXiv preprint arXiv:2602.12276,
-
[10]
InFindings of the Association for Computational Linguistics: ACL 2025, pages 26847–26858
Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026a. Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. Arise: Agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060, 2026b. Yu Li,...
-
[11]
Shuai Liu, Peng Zhang, and Xi Chen. SEAgent: Bridging semantic understanding and action generation for computer-use agents.arXiv preprint arXiv:2503.00208, 2025a. Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platf...
-
[12]
Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels. arXiv preprint arXiv:2603.19312,
-
[13]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Andrei Polubarov, Lyubaykin Nikita, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, and Vladislav Kurenkov. Vintix: Action model via in-context rein- forcement learning. InForty-second International Conference on Machine Learning. Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, et al. UI-TARS: Pioneering automated GUI i...
-
[14]
Self-Reflection in Large Language Model Agents: Effects on Problem-Solving Performance
doi: 10.1109/FLLM63129.2024. 10852426. Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadel- mann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions.arXiv preprint arXiv:2501.16150,
-
[15]
Sizhe Tang, Rongqian Chen, and Tian Lan. Agent alpha: Tree search unifying generation, exploration and evaluation for computer-use agents.arXiv preprint arXiv:2602.02995,
-
[16]
Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Y...
-
[17]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shi, Joel Tao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972,
work page internal anchor Pith review arXiv
-
[18]
Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791,
-
[19]
Sigmoid loss for language image pre-training, 2023
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.arXiv preprint arXiv:2303.15343,
-
[20]
Zuyuan Zhang, Vaneet Aggarwal, and Tian Lan. Lisfc-search: Lifelong search for network sfc optimization under non-stationary drifts.arXiv preprint arXiv:2602.14360, 2026a. Zuyuan Zhang, Zeyu Fang, and Tian Lan. Structuring value representations via geometric coherence in markov decision processes.arXiv preprint arXiv:2602.02978, 2026b. Zhenjun Zhao. Balf:...
-
[21]
Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,
Zhenjun Zhao, Heng Yang, Bangyan Liao, Yingping Zeng, Shaocheng Yan, Yingdong Gu, Peidong Liu, Yi Zhou, Haoang Li, and Javier Civera. Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,
-
[22]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,
work page internal anchor Pith review arXiv
-
[23]
A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,
Zhenwen Zhu, Yutong Wan, Kevin Zhang, Jing Shao, and Bin Ye. A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,
-
[24]
MPNet + SigLIP2 + larger model
Total∼13M trainable parameters B Data statistics 45.9% of Ubuntu tasks contain at least one incorrect step, providing hard negatives for the margin ranking loss. All evaluation uses atask-levelsplit of the Ubuntu subset: 85% train / 10% validation / 5% test, ensuring no step from a test task appears during training. The cross-OS data (Windows + Mac) is us...
1920
-
[25]
Adding incorrect-step negatives (labeled rt =
Negative type matters more than quantity.Adjacent-step negatives ( t±1) are the most effective training signal for Hard test performance, as they require distinguishing temporally close actions that share nearly identical UI context—a challenge shared by offline RL methods that must learn from suboptimal demonstrations without environment interaction (Fan...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.