pith. sign in

arxiv: 2606.09447 · v1 · pith:JO7DSXVGnew · submitted 2026-06-08 · 💻 cs.AI

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

Pith reviewed 2026-06-27 16:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords web agentsreinforcement learningcloud consolesdistillationdocumentation verificationGRPOAI training
0
0 comments X

The pith

A 32B open model trained by distillation and RL reaches 63.52 percent success on cloud console verification tasks, within 1.82 points of the best proprietary model at 92 percent lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cloud platforms require millions of annual checks to ensure documentation matches their rapidly changing consoles, but manual efforts cover less than one percent. Proprietary frontier models perform well on this but are too expensive and raise privacy concerns for large-scale use. The paper shows that distilling their trajectories into a smaller model and then applying reinforcement learning directly in real cloud environments, using a deterministic rollout system and objective rewards from audit logs, produces a 32B model that nearly matches the top proprietary performance. This opens the door to affordable, private automation of the verification process.

Core claim

Through a two-stage process of supervised fine-tuning on trajectories distilled from frontier models and subsequent reinforcement learning with Group Relative Policy Optimization using a dual-channel outcome reward model, the AliyunConsoleAgent-32B evolves to handle real cloud console tasks autonomously. Supported by Terraform-based resource provisioning for high-determinism rollouts and rule-based evaluation from backend audit logs, it achieves a 63.52% mean success rate on a 278-task benchmark. This represents a 20.24 percentage point gain over the base model and narrows the gap to the leading frontier model to 1.82 percentage points while cutting inference costs by 92%.

What carries the argument

The two-stage distillation followed by GRPO reinforcement learning in real cloud environments with rule-based rewards from audit logs and a high-determinism rollout system using Terraform pre-provisioning.

If this is right

  • Automated verification of cloud documentation can scale to the required millions of annual inspections.
  • The trained model acquires product-specific understanding beyond mechanical instruction following.
  • Real-world RL training becomes feasible without environment noise corrupting the signal.
  • Cost and privacy barriers to deploying web agents in enterprise cloud settings are substantially reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar distillation and RL pipelines could adapt open models for other complex web-based enterprise systems.
  • The emphasis on backend audit logs for rewards may generalize to other domains where objective outcome signals are available.
  • Further scaling the number of cloud products in training could enhance the model's ability to handle feature iterations autonomously.

Load-bearing premise

The rule-based reward protocol from backend audit logs delivers objective, reward-hacking-resistant signals that stay unbiased across varied cloud products and UI states.

What would settle it

Running the AliyunConsoleAgent-32B independently on the 278-task benchmark and observing whether its success rate remains within the bootstrap 95% confidence interval of being only 1.82 points below the frontier model.

Figures

Figures reproduced from arXiv: 2606.09447 by Bojie Rong, Hanyu Wu, Leihao Pei, Linquan Jiang, Pengfei Kang, Qiaoping Wang, Yang Xu, Yawen Wei, Zheyu Shen, Zhi Zhao.

Figure 1
Figure 1. Figure 1: Per-task inference cost vs. pass@1 success rate. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AliyunConsoleAgent training pipeline: two-stage SFT+GRPO training atop the high-determinism Rollout Environment, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Execution trace for an RDS auto-renewal task. The GRPO model autonomously enables auto-renewal first to create [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layered architecture of the Rollout environment: Account Pool, Sandbox Execution, Resource META Provisioning, [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. It proposes a two-stage training paradigm consisting of supervised fine-tuning on distilled frontier-model trajectories followed by reinforcement learning with Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model grounded in backend audit logs. A high-determinism rollout system using Terraform-based provisioning is introduced to support large-scale RL. On a 278-task benchmark where the best frontier model achieves 65.34%, the resulting 32B model reaches 63.52% mean success rate (20.24 pp improvement over base model), narrowing the gap to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) at 92% lower inference cost.

Significance. If the empirical results hold, the work would show that open 32B-scale models can be trained via distillation and RL to approach proprietary frontier performance on complex, product-specific web agent tasks in live cloud environments while achieving substantial cost reductions. The engineering contribution of the high-determinism rollout system for isolating environment noise is a practical strength for scalable agent training.

major comments (2)
  1. [Abstract] Abstract (rule-based reward protocol): The protocol is described as objective and reward-hacking-resistant because it uses backend audit logs rather than LLM-as-judge. However, the same protocol supplies the dual-channel outcome reward for GRPO training and the success-rate metric for the 278-task benchmark. No human validation study, inter-rater agreement check, or analysis of log incompleteness/ambiguity across diverse cloud products is reported, which directly affects the reliability of both the training signal and the headline 63.52% result.
  2. [Experiments] Experiments (implied by abstract results): The abstract supplies concrete success rates, a 20.24 pp improvement, and a bootstrap CI, yet supplies no details on task construction, data splits, ablation studies, or statistical controls for the 278-task benchmark. These omissions make it impossible to assess whether the reported narrowing of the gap to the frontier model is robust.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of evaluation reliability and experimental transparency that we address below. We propose targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (rule-based reward protocol): The protocol is described as objective and reward-hacking-resistant because it uses backend audit logs rather than LLM-as-judge. However, the same protocol supplies the dual-channel outcome reward for GRPO training and the success-rate metric for the 278-task benchmark. No human validation study, inter-rater agreement check, or analysis of log incompleteness/ambiguity across diverse cloud products is reported, which directly affects the reliability of both the training signal and the headline 63.52% result.

    Authors: We agree that explicit validation of the rule-based protocol would increase confidence in both the RL training signal and the reported results. The audit logs record all API invocations and resource state changes with high fidelity as they originate from the cloud provider's production logging system. Nevertheless, to address potential edge cases such as partial log coverage for certain products, we will add a dedicated subsection in Section 4 (Experiments) that (i) quantifies log completeness across the 278 tasks, (ii) discusses known ambiguities, and (iii) reports a post-hoc human validation study on a random 50-task subset, including inter-rater agreement statistics between the rule-based judgments and two human annotators. These additions will be included in the revised version. revision: yes

  2. Referee: [Experiments] Experiments (implied by abstract results): The abstract supplies concrete success rates, a 20.24 pp improvement, and a bootstrap CI, yet supplies no details on task construction, data splits, ablation studies, or statistical controls for the 278-task benchmark. These omissions make it impossible to assess whether the reported narrowing of the gap to the frontier model is robust.

    Authors: The full manuscript contains Section 4 (Experiments) that describes benchmark construction (tasks derived from real documentation-verification tickets across 12 cloud products), the 70/30 train/test split used for SFT, and ablation studies isolating the contributions of SFT distillation and GRPO. The bootstrap CI is obtained via 1,000 resamples of the task set with replacement. We acknowledge that the abstract itself is too concise to convey these controls. In revision we will (i) expand the abstract with one additional sentence summarizing benchmark provenance and (ii) add per-product success-rate tables and variance estimates as supplementary statistical controls in Section 4.2. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results independent of training signal

full rationale

The paper reports measured success rates on a 278-task benchmark after SFT+GRPO training. The rule-based reward protocol supplies the RL training signal and is also used to compute the reported success rates, but this does not constitute circularity under the enumerated patterns: no equation reduces a derived quantity to a fitted parameter by construction, no self-citation chain justifies a uniqueness claim, and no ansatz or renaming is smuggled in. The headline numbers are direct empirical counts on held-out tasks rather than predictions forced by the training objective itself. The derivation chain (distillation followed by GRPO with dual-channel rewards) remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that cloud console environments can be made deterministic enough for stable RL signals; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Real cloud console environments can be provisioned with sufficient determinism using Terraform and LLM-driven methods to isolate training signals from noise.
    Presented as essential for large-scale RL training in the abstract.

pith-pipeline@v0.9.1-grok · 5866 in / 1241 out tokens · 34920 ms · 2026-06-27T16:20:03.281200+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 20 canonical work pages · 9 internal anchors

  1. [1]

    Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. 2024. DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.arXiv preprint arXiv:2406.11896(2024)

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of ACL

  4. [4]

    DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948(2025)

  5. [5]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. In Advances in Neural Information Processing Systems

  6. [6]

    Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Gu, and Izzeddin Gur. 2024. Multimodal Web Navigation with Instruction-Finetuned Foundation Models. InInternational Conference on Learn- ing Representations

  7. [7]

    HashiCorp. 2024. Terraform: Infrastructure as Code. https://www.terraform.io

  8. [8]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models.Proceedings of ACL(2024)

  9. [9]

    Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, et al. 2024. The BrowserGym Ecosystem for Web Agent Research.arXiv preprint arXiv:2412.05467(2024)

  10. [10]

    Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, and Xing Sun. 2025. CUAReward- Bench: A Benchmark for Evaluating Reward Models on Computer-using Agent. arXiv:2510.18596 [cs.SE] https://arxiv.org/abs/2510.18596

  11. [11]

    Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, et al . 2024. AutoGLM: Autonomous Foundation Agents for GUIs.arXiv preprint arXiv:2411.00820(2024)

  12. [12]

    Xing Han Lu, Zdeněk Kasner, and Siva Reddy. 2024. WebLINX: Real-World Website Navigation with Multi-Turn Dialogue.Proceedings of ICML(2024)

  13. [13]

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. 2024. Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents.arXiv preprint arXiv:2408.07199(2024)

  14. [14]

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. 2025. WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning. arXiv:2411.02337 [cs.CL] https: //arxiv.org/abs/2411.02337

  15. [15]

    Yujia Qin et al . 2025. UI-TARS: Pioneering Automated GUI Interaction with Native Agents.arXiv preprint arXiv:2501.12326(2025)

  16. [16]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InProceedings of NeurIPS

  17. [17]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  18. [18]

    Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347 (2017)

  19. [19]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Yang Wu, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300(2024)

  20. [20]

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang

  21. [21]

    In Proceedings of ICML

    World of Bits: An Open-Domain Platform for Web-Based Agents. In Proceedings of ICML

  22. [22]

    Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guo- hao Li, Junxian He, Yu Qiao, and Zhiyong Wu. 2024. OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis.arXiv preprint arXiv:2412.19723(2024)

  23. [23]

    Zeyi Sun et al . 2025. SEAgent: Self-Evolving Computer Use Agent with Au- tonomous Learning from Experience.arXiv preprint arXiv:2508.04700(2025)

  24. [24]

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al . 2025. UI- TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning.arXiv preprint arXiv:2509.02544(2025)

  25. [25]

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. 2024. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094

  26. [26]

    Yifan Xu, Xiao Liu, et al. 2025. MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents.arXiv preprint arXiv:2509.18119(2025)

  27. [27]

    Taofeng Xue, Chong Peng, et al. 2026. EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience.arXiv preprint arXiv:2601.15876 (2026)

  28. [28]

    Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Yu Qiao, Wenhai Wang, Xizhou Zhu, and Jifeng Dai. 2025. ZeroGUI: Automating Online GUI Learning at Zero Human Cost.arXiv preprint arXiv:2505.23762(2025)

  29. [29]

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao

  30. [30]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V.arXiv preprint arXiv:2310.11441(2023)

  31. [31]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of ICLR

  32. [32]

    Qiying Yu et al. 2025. DAPO: An Open-Source LLM Reinforcement Learning System at Scale.arXiv preprint arXiv:2503.14476(2025)

  33. [33]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. InProceedings of NeurIPS

  34. [34]

    Hanzhang Zhou, Xu Zhang, et al. 2025. MAI-UI Technical Report: Real-World Centric Foundation GUI Agents.arXiv preprint arXiv:2512.22047(2025)

  35. [35]

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InProceedings of ICLR. GenAI Usage Disclosure In accordance with the ACM Policy on the use of Generative AI, we disclose that generative AI tools ...