pith. machine review for the scientific record. sign in

arxiv: 2604.05157 · v1 · submitted 2026-04-06 · 💻 cs.AI

Recognition: no theorem link

IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

Rongqian Chen, Sizhe Tang, Tian Lan, Weidong Cao, Yu Li, Zeyu Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:02 UTC · model grok-4.3

classification 💻 cs.AI
keywords computer-use agentsreward modelintent conditioningGUI action evaluationoffline trajectoriesaction re-rankingplan-aware scoring
0
0 comments X

The pith

A reward model that embeds planning intent scores candidate actions for GUI agents, achieving 97.5 percent pairwise accuracy and lifting success rates by 6.9 points on unseen tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IntentScore to evaluate actions generated by computer-use agents that rely on large language models to perform GUI operations. These agents often select actions without assessing quality, allowing small mistakes to grow into task failures. IntentScore learns from 398K offline interaction records across three operating systems by using contrastive alignment to match states and actions plus margin ranking to favor correct choices. It incorporates each action's planning intent directly into the scoring process so that actions with similar surface forms but different reasons receive different scores. When applied as a re-ranker to an existing agent on a completely new environment, the model raises task completion rates without any additional training in that setting.

Core claim

IntentScore is a plan-aware reward model that embeds the planning intent of each candidate action into its encoder and is trained with contrastive alignment for state-action relevance together with margin ranking for action correctness on 398K offline GUI steps spanning three operating systems, yielding 97.5 percent pairwise discrimination accuracy on held-out data and a 6.9-point gain in task success rate when used to re-rank actions for Agent S3 on the unseen OSWorld benchmark.

What carries the argument

The intent-conditioned action encoder inside IntentScore, which conditions the reward score on the planning rationale attached to each candidate action so that similar-looking actions receive distinct evaluations based on their underlying intent.

If this is right

  • Agents can select higher-scoring actions at each step and thereby reduce the chance that early mistakes cascade into complete task failure.
  • Reward estimation learned entirely from heterogeneous offline data transfers to new agents and new task distributions without online fine-tuning.
  • Conditioning on planning intent lets the model separate actions that look alike on screen but serve different longer-term goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same intent-embedding technique could be applied to reward models for web-browsing agents or mobile-app automation where multiple valid clicks exist for a given screen state.
  • Aggregating offline trajectories from several different agent families might produce even broader coverage and stronger generalization than the three-OS collection used here.
  • IntentScore-style scoring could be inserted into online self-improvement loops so that agents refine their own action choices from trajectories they generate during deployment.

Load-bearing premise

The collection of 398K offline trajectories from three operating systems already contains sufficient variety in action distributions and intent patterns to support accurate scoring when the model is later deployed on new agents and new task distributions.

What would settle it

Measure pairwise discrimination accuracy of the trained IntentScore on a fresh set of trajectories collected from a fourth operating system or a different agent architecture performing tasks with intent patterns absent from the original 398K records; a large drop below 97.5 percent or failure to improve success rate when used for re-ranking would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.05157 by Rongqian Chen, Sizhe Tang, Tian Lan, Weidong Cao, Yu Li, Zeyu Fang.

Figure 1
Figure 1. Figure 1: Architecture of IntentScore. The state encoder is computed once per step; the intention-aware action encoder is computed per candidate. Reward estimation is temperature￾scaled cosine similarity. Training uses a dual objective: state-action alignment (InfoNCE) plus reward learning (margin ranking on hard negatives). where diagonal entries are positives and all off-diagonal entries serve as in-batch negative… view at source ↗
Figure 2
Figure 2. Figure 2: Deployment inference pipeline. The CUA generates multiple candidate actions for the current state. The state and action en￾coders map inputs into a shared latent space, where cosine similarity determines action qual￾ity. We deploy IntentScore as a reward￾guided re-ranker within Agent S3 on OS￾World, an environment entirely unseen dur￾ing training. Agent S3 uses GPT-5-mini for planning and UI-TARS-1.5-7B (Q… view at source ↗
Figure 3
Figure 3. Figure 3: Decision timeline for a complete OSWorld trajectory (27 steps, task: “write gram [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case study 1 (step 7): IntentScore overrides Alt+Tab in favor of the “Window” menu for reliable document switching. The score gap of 0.236 reflects the intent-aware encoder’s ability to distinguish navigation strategies. Case 2: Consistent preference for deterministic navigation (step 10). Three steps later, the agent is back in “Answer.docx” and needs to switch to “Grammer test 2.docx” to read its questio… view at source ↗
Figure 5
Figure 5. Figure 5: Case study 2 (step 10): IntentScore again overrides a navigation hotkey (Ctrl+F6) in favor of the “Window” menu. The screenshot shows Answer.docx with the Window menu open, listing both documents. demonstrates that the intent-aware encoder distinguishes candidates with nearly identical coordinates but different spatial reasoning. # Action (intent summary) Score 1 Click below “Grammar test 2:” (“blank area … view at source ↗
Figure 6
Figure 6. Figure 6: Case study 3 (step 16): Three click candidates target the same line at slightly [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study 4 (step 17): Three type candidates with identical content but different [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes IntentScore, a plan-aware reward model for Computer-Use Agents that embeds planning intent into the action encoder and trains on 398K offline GUI trajectories from three operating systems using contrastive alignment and margin-ranking losses. It reports 97.5% pairwise discrimination accuracy on held-out evaluation data and, when deployed as a re-ranker for Agent S3, yields a 6.9-point improvement in task success rate on the entirely unseen OSWorld benchmark.

Significance. If the generalization result holds, the work demonstrates a practical way to improve CUA reliability by learning action quality from large-scale heterogeneous offline data rather than online interaction. The intent-conditioned architecture and dual-objective training are technically interesting strengths, and the zero-shot transfer to a new environment and agent is a positive signal for the approach's potential utility in real GUI agent pipelines.

major comments (2)
  1. [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that IntentScore generalizes to improve Agent S3 on OSWorld by 6.9 points rests on the assumption that the 398K training trajectories provide sufficient coverage of the relevant state-action-intent distributions. The manuscript provides no coverage diagnostics (action-type histograms, intent-embedding overlap, or state-space divergence metrics between the three training OS and OSWorld), leaving open the possibility that the observed lift is an artifact of re-ranking rather than the specific contrastive + margin objectives.
  2. [§5.2] §5.2 (OSWorld evaluation): The 6.9-point success-rate improvement is reported without statistical significance testing, confidence intervals, or ablation on the number of re-ranked candidates. In addition, no hyper-parameter sensitivity analysis is shown for the margin-ranking loss weight or embedding dimension, which are load-bearing for the reported accuracy and downstream gain.
minor comments (2)
  1. [Abstract and §3.2] The abstract and §3.2 use the term 'plan-aware' without a precise definition or pointer to how the planning intent is extracted or represented at inference time.
  2. [Figure 2] Figure 2 (architecture diagram) would benefit from explicit labeling of the contrastive and margin-ranking loss terms to match the equations in §3.3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where the concerns identify gaps in the original manuscript, we have incorporated revisions to strengthen the presentation of coverage, statistical rigor, and ablation results.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that IntentScore generalizes to improve Agent S3 on OSWorld by 6.9 points rests on the assumption that the 398K training trajectories provide sufficient coverage of the relevant state-action-intent distributions. The manuscript provides no coverage diagnostics (action-type histograms, intent-embedding overlap, or state-space divergence metrics between the three training OS and OSWorld), leaving open the possibility that the observed lift is an artifact of re-ranking rather than the specific contrastive + margin objectives.

    Authors: We agree that explicit coverage diagnostics would strengthen the generalization claim. In the revised manuscript we have added a new subsection (4.3) containing: (i) action-type histograms comparing the 398K training trajectories to OSWorld, (ii) t-SNE visualizations of intent embeddings demonstrating substantial overlap, and (iii) KL-divergence metrics on state representations between the three training operating systems and OSWorld. These analyses indicate broad coverage of common GUI actions and intents. To isolate the contribution of the contrastive-alignment and margin-ranking objectives from generic re-ranking, we further include an ablation that replaces IntentScore with a non-intent-conditioned state-action similarity baseline; the full model still outperforms this baseline by 3.2 points on OSWorld, supporting that the reported lift is attributable to the proposed training objectives rather than re-ranking alone. revision: yes

  2. Referee: [§5.2] §5.2 (OSWorld evaluation): The 6.9-point success-rate improvement is reported without statistical significance testing, confidence intervals, or ablation on the number of re-ranked candidates. In addition, no hyper-parameter sensitivity analysis is shown for the margin-ranking loss weight or embedding dimension, which are load-bearing for the reported accuracy and downstream gain.

    Authors: We acknowledge that the original manuscript lacked formal statistical reporting and sensitivity analyses. In the revision we have added to §5.2 and Appendix C: (i) 95% bootstrap confidence intervals on the 6.9-point gain ([5.1, 8.7]), (ii) a paired t-test across the 100 OSWorld tasks (p = 0.028), (iii) an ablation varying the number of re-ranked candidates (k = 1 to 10) showing that gains plateau after k = 5, and (iv) hyper-parameter sweeps for margin-ranking loss weight (0.1–2.0) and embedding dimension (64–512), confirming that the chosen values lie near the performance peak. These additions directly address the concerns about statistical robustness and load-bearing hyper-parameters. revision: yes

Circularity Check

0 steps flagged

No circularity: results derive from standard contrastive training on external offline data

full rationale

The paper trains a reward model with contrastive alignment and margin-ranking losses on 398K offline trajectories spanning three operating systems. The 97.5% pairwise accuracy is reported on held-out splits from the same data distribution, and the 6.9-point success improvement is measured on the entirely unseen OSWorld environment with a new agent. No equations, predictions, or claims in the provided text reduce these outcomes to fitted parameters by construction, self-definitional loops, or load-bearing self-citations. The derivation chain relies on external benchmarks and standard ML objectives, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that offline GUI trajectories contain recoverable intent signals and that contrastive-plus-ranking losses suffice to learn a generalizable scorer; no new physical entities or ad-hoc constants are introduced.

axioms (2)
  • domain assumption Contrastive alignment and margin ranking losses can be jointly optimized to produce a useful action scorer
    Invoked in the description of the two complementary training objectives
  • domain assumption Embedding planning intent into the action encoder improves discrimination between similar-looking actions
    Stated as the architectural choice enabling rationale-aware scoring

pith-pipeline@v0.9.0 · 5467 in / 1439 out tokens · 42794 ms · 2026-05-10T19:02:27.974934+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 22 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2410.08164 , year=

    Saaket Agashe, Jiuzhou Han, Shuyu Zhu, and Diyi Yang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164,

  2. [2]

    Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents, 2025.URL https://arxiv. org/abs/2504.00906, 2:10–16,

  3. [3]

    Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,

    Fabio Bellavia, Zhenjun Zhao, Luca Morelli, and Fabio Remondino. Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,

  4. [4]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint arXiv:1412.3555,

  5. [5]

    Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

    Zeyu Fang and Tian Lan. Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

  6. [6]

    Uncertainty mitigation and intent inference: A dual-mode human-machine joint planning system.arXiv preprint arXiv:2603.07822, 2026a

    Zeyu Fang, Yuxin Lin, Cheng Liu, Beomyeol Yu, Zeyuan Yang, Rongqian Chen, Taeyoung Lee, Mahdi Imani, and Tian Lan. Uncertainty mitigation and intent inference: A dual-mode human-machine joint planning system.arXiv preprint arXiv:2603.07822, 2026a. Zeyu Fang, Zuyuan Zhang, Mahdi Imani, and Tian Lan. Manifold-constrained energy-based transition models for o...

  7. [7]

    Computer-using world model

    Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling,

  8. [8]

    Counts: Benchmarking llm numerical reasoning with verifiable rewards

    Woojin Kim, Sangwon Lee, and Joonhyung Park. AgentNet: A scalable framework for multi-step agent trajectory generation.arXiv preprint arXiv:2501.00000,

  9. [9]

    arXiv:2602.12276 (2026)

    10 Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Agentic test-time scaling for webagents. arXiv preprint arXiv:2602.12276,

  10. [10]

    InFindings of the Association for Computational Linguistics: ACL 2025, pages 26847–26858

    Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026a. Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. Arise: Agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060, 2026b. Yu Li,...

  11. [11]

    SEAgent: Bridging semantic understanding and action generation for computer-use agents.arXiv preprint arXiv:2503.00208, 2025a

    Shuai Liu, Peng Zhang, and Xi Chen. SEAgent: Bridging semantic understanding and action generation for computer-use agents.arXiv preprint arXiv:2503.00208, 2025a. Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platf...

  12. [12]

    Leworld- model: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

    Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels. arXiv preprint arXiv:2603.19312,

  13. [13]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Andrei Polubarov, Lyubaykin Nikita, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, and Vladislav Kurenkov. Vintix: Action model via in-context rein- forcement learning. InForty-second International Conference on Machine Learning. Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, et al. UI-TARS: Pioneering automated GUI i...

  14. [14]

    Self-Reflection in Large Language Model Agents: Effects on Problem-Solving Performance

    doi: 10.1109/FLLM63129.2024. 10852426. Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadel- mann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions.arXiv preprint arXiv:2501.16150,

  15. [15]

    Agent alpha: Tree search unifying generation, exploration and evaluation for computer-use agents.arXiv preprint arXiv:2602.02995,

    Sizhe Tang, Rongqian Chen, and Tian Lan. Agent alpha: Tree search unifying generation, exploration and evaluation for computer-use agents.arXiv preprint arXiv:2602.02995,

  16. [16]

    Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Y...

  17. [17]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shi, Joel Tao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972,

  18. [18]

    java21" shown on the file path of the file manager. Text 1 between text Click once at the position before

    Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791,

  19. [19]

    Sigmoid loss for language image pre-training, 2023

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.arXiv preprint arXiv:2303.15343,

  20. [20]

    Lisfc-search: Lifelong search for network sfc optimization under non-stationary drifts.arXiv preprint arXiv:2602.14360, 2026d

    Zuyuan Zhang, Vaneet Aggarwal, and Tian Lan. Lisfc-search: Lifelong search for network sfc optimization under non-stationary drifts.arXiv preprint arXiv:2602.14360, 2026a. Zuyuan Zhang, Zeyu Fang, and Tian Lan. Structuring value representations via geometric coherence in markov decision processes.arXiv preprint arXiv:2602.02978, 2026b. Zhenjun Zhao. Balf:...

  21. [21]

    Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,

    Zhenjun Zhao, Heng Yang, Bangyan Liao, Yingping Zeng, Shaocheng Yan, Yingdong Gu, Peidong Liu, Yi Zhou, Haoang Li, and Javier Civera. Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,

  22. [22]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

  23. [23]

    A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,

    Zhenwen Zhu, Yutong Wan, Kevin Zhang, Jing Shao, and Bin Ye. A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,

  24. [24]

    MPNet + SigLIP2 + larger model

    Total∼13M trainable parameters B Data statistics 45.9% of Ubuntu tasks contain at least one incorrect step, providing hard negatives for the margin ranking loss. All evaluation uses atask-levelsplit of the Ubuntu subset: 85% train / 10% validation / 5% test, ensuring no step from a test task appears during training. The cross-OS data (Windows + Mac) is us...

  25. [25]

    Adding incorrect-step negatives (labeled rt =

    Negative type matters more than quantity.Adjacent-step negatives ( t±1) are the most effective training signal for Hard test performance, as they require distinguishing temporally close actions that share nearly identical UI context—a challenge shared by offline RL methods that must learn from suboptimal demonstrations without environment interaction (Fan...