OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Pith reviewed 2026-06-28 15:44 UTC · model grok-4.3
The pith
Online multi-turn RL on live websites trains a 4B visual web agent to 67% success with 0.4K init trajectories and 2.2K tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenWebRL is an open framework for training visual web agents via online multi-turn RL on real websites. It supplies the full pipeline of scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Trained with only 0.4K initialization trajectories and 2.2K open-ended RL tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, establishing new open-source state-of-the-art results while remaining competitive with proprietary systems such as OpenAI CUA and Gemini CUA. The work also examines the design choices that enable effective RL and analyzes how t
What carries the argument
The OpenWebRL framework whose live-browser infrastructure and trajectory-level success judging supply the reward signals that support stable multi-turn policy optimization on changing websites.
If this is right
- Visual web agents can be trained scalably without collecting large curated demonstration datasets.
- Online RL directly on live sites improves agentic reasoning beyond what supervised post-training alone achieves.
- Modest numbers of open-ended tasks suffice for effective multi-turn optimization when paired with trajectory-level rewards.
- Open-source agents can reach performance levels competitive with proprietary systems through this training route.
Where Pith is reading between the lines
- The same live-environment RL loop could be adapted to train agents for other interactive interfaces such as mobile apps.
- The reported data efficiency suggests the method could lower the compute and annotation cost of building new web agents in resource-constrained settings.
- Further experiments that vary the judging granularity might reveal whether finer-grained rewards would accelerate learning on complex sites.
Load-bearing premise
Trajectory-level success judging on live browsers supplies reward signals with low enough noise to support stable multi-turn policy optimization on dynamic real-world websites.
What would settle it
If the trained 4B agent shows success rates below 50% when evaluated on a fresh set of live websites whose interfaces were not encountered during the 2.2K RL tasks, the claim that the online RL pipeline produces effective and stable policies would be falsified.
read the original abstract
Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OpenWebRL, an open framework for training visual web agents via online multi-turn RL directly on live websites. It covers the full pipeline: scalable live-browser infrastructure, supervised initialization from 0.4K trajectories, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. The central empirical claim is that OpenWebRL-4B, after 2.2K open-ended RL tasks, reaches 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale while remaining competitive with proprietary systems such as OpenAI CUA and Gemini CUA. The work also examines key design choices and how RL improves agentic reasoning.
Significance. If the reported results hold under rigorous validation, the contribution would be significant: it demonstrates that online multi-turn RL can produce competitive visual web agents with far smaller data volumes than supervised post-training on curated trajectories, while releasing infrastructure, data, models, and code to enable reproducible open research. This directly addresses the scalability bottleneck highlighted in the abstract and provides concrete evidence on effective design choices for RL in dynamic web environments.
major comments (2)
- [§4 and §3.4] §4 (Experiments) and §3.4 (Success Judging): The headline performance numbers rest on the assumption that trajectory-level success judging supplies sufficiently low-noise rewards for stable multi-turn policy optimization. The manuscript must supply quantitative validation of the judge (e.g., agreement rate with human labels on a held-out set of trajectories, false-positive/false-negative rates on dynamic pages) to substantiate that credit assignment across long horizons is reliable; without this, the gains achieved with only 2.2K RL tasks remain difficult to attribute to effective RL rather than judge artifacts.
- [Table 1 / Results] Table 1 / Results: The reported success rates (67.0% and 64.0%) are presented without error bars, variance across seeds, or ablations isolating the contribution of the success judge versus other pipeline components. This omission is load-bearing because the central claim is that online RL succeeds with small data volumes; the absence of these controls leaves open the possibility that results are sensitive to judging noise or website state variability.
minor comments (2)
- [Abstract] Abstract: The phrase 'we systematically study the key design choices' is stated without enumerating them; a brief parenthetical list or forward reference to the relevant section would improve clarity.
- [§5] §5 (Analysis): The discussion of how RL improves agentic reasoning would benefit from explicit comparison of pre- and post-RL trajectories on the same tasks to illustrate the claimed improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive comments emphasizing the need for rigorous validation of the success judge and improved statistical reporting. We address each major comment below and will revise the manuscript accordingly to strengthen these aspects.
read point-by-point responses
-
Referee: [§4 and §3.4] §4 (Experiments) and §3.4 (Success Judging): The headline performance numbers rest on the assumption that trajectory-level success judging supplies sufficiently low-noise rewards for stable multi-turn policy optimization. The manuscript must supply quantitative validation of the judge (e.g., agreement rate with human labels on a held-out set of trajectories, false-positive/false-negative rates on dynamic pages) to substantiate that credit assignment across long horizons is reliable; without this, the gains achieved with only 2.2K RL tasks remain difficult to attribute to effective RL rather than judge artifacts.
Authors: We agree that quantitative validation of the judge is necessary to confidently attribute performance gains to the RL process. The manuscript describes the trajectory-level success judge in §3.4 but does not include human agreement metrics or error rate breakdowns. In the revised manuscript we will add a dedicated analysis in §3.4 reporting agreement rates with human labels on a held-out trajectory set together with false-positive and false-negative rates stratified by page dynamism. This addition will directly address concerns about reward noise and credit assignment reliability. revision: yes
-
Referee: [Table 1 / Results] Table 1 / Results: The reported success rates (67.0% and 64.0%) are presented without error bars, variance across seeds, or ablations isolating the contribution of the success judge versus other pipeline components. This omission is load-bearing because the central claim is that online RL succeeds with small data volumes; the absence of these controls leaves open the possibility that results are sensitive to judging noise or website state variability.
Authors: We acknowledge that the absence of error bars, seed variance, and judge-specific ablations weakens the robustness claims. The current manuscript reports point estimates only. In the revision we will add error bars for the main results (computed from available repeated evaluations), report observed variance, and include an ablation that isolates the judge by comparing against alternative reward formulations. Due to the high cost of live-web RL runs, full multi-seed experiments are resource-intensive, so we will provide all feasible statistical controls and ablations rather than exhaustive ones. revision: partial
Circularity Check
No circularity: empirical training results with no derived quantities or self-referential fits
full rationale
The paper reports direct empirical success rates (67.0% on Online-Mind2Web, 64.0% on DeepShop) obtained by running online multi-turn RL on live websites using 0.4K initialization trajectories and 2.2K RL tasks. No equations, parameter fits, or first-principles derivations are described; the central claims are measured benchmark outcomes rather than quantities obtained by construction from prior self-citations or normalizations. The trajectory-level success judge is presented as an engineering component of the framework, not as a fitted or self-defined predictor. This is a standard empirical RL paper whose performance numbers stand or fall on external replication, not on internal definitional equivalence.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Bir´ e, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, et al. Surfer-h meets holo1: Cost-efficient web agent powered by open weights.arXiv preprint arXiv:2506.02865, 2025
-
[3]
Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025
Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, Spencer Whitehead, and An- drew Zhao. Fara-7B: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025. 13 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for...
-
[4]
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. WebGym: Scaling training environments for visual web agents with realistic tasks.arXiv preprint arXiv:2601.02439, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning. In Advances in Neural Information Processing Systems, volume 37, 2024
2024
-
[6]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Web agents with world models: Learning and leveraging environment dynamics in web navigation
Hyungjoo Chae, Namyoung Kim, Kai Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. InInternational Conference on Learning Representations, volume 2025, pages 63707–63738, 2025
2025
-
[8]
Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan, et al. Era: Transforming vlms into embodied agents via embodied prior learning and online reinforcement learning.arXiv preprint arXiv:2510.12693, 2025
-
[9]
Yuxi Chen, Haoyu Zhai, Chenkai Wang, Rui Yang, Lingming Zhang, Gang Wang, and Huan Zhang. Captcha solving for native gui agents: Automated reasoning-action data generation and self-corrective training.arXiv preprint arXiv:2603.23559, 2026
work page internal anchor Pith review arXiv 2026
-
[10]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
2024
-
[11]
Seeclick: Harnessing gui grounding for advanced visual gui agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024
2024
-
[12]
Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023
2023
-
[13]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023
2023
-
[14]
Navigating the digital world as humans do: Universal visual grounding for gui agents
Boyu Gou, Demi Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InInternational Conference on Learning Representations, volume 2025, pages 30851–30883, 2025
2025
-
[15]
Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024
-
[16]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
2025
-
[17]
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, et al. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Webvoyager: Building an end-to-end web agent with large multimodal models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In 14 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics...
2024
-
[19]
Openwebvoyager: Building multimodal web agents via iterative real-world ex- ploration, feedback and optimization
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Hongming Zhang, Tianqing Fang, Zhenzhong Lan, and Dong Yu. Openwebvoyager: Building multimodal web agents via iterative real-world ex- ploration, feedback and optimization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27545–27564, 2025
2025
-
[20]
Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, and Xia Song. Scalable data synthesis for computer use agents with step-level filtering.arXiv preprint arXiv:2512.10962, 2025
-
[21]
Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025
2025
-
[22]
Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.Advances in Neural Information Processing Systems, 38, 2026
Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan Wang, and Kai-Wei Chang. Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.Advances in Neural Information Processing Systems, 38, 2026
2026
-
[23]
Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, et al. Rethinking memory mechanisms of foundation agents in the second half: A survey.arXiv preprint arXiv:2602.06052, 2026
-
[24]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025
-
[26]
Visual-rft: Visual reinforcement fine-tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025
2034
-
[27]
Xing Han L` u, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Sta´ nczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025
-
[28]
Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17608–17616, 2026
2026
-
[29]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025
Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuying Chen. Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025
-
[31]
Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling.Advances in Neural Information Processing Systems, 37:134387–134429, 2024
Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, and Dacheng Tao. Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling.Advances in Neural Information Processing Systems, 37:134387–134429, 2024
2024
-
[32]
WebCanvas: Benchmarking Web Agents in Online Environments
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373, 2024. 15 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Orchard: An Open-Source Agentic Modeling Framework
Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandrio Sordoni, Xingdi Yuan, Yelong Shen, et al. Orchard: An open-source agentic modeling framework. arXiv preprint arXiv:2605.15040, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning
Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. InInternational Conference on Learning Representations, volume 2025, pages 79791–79821, 2025
2025
-
[35]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025
Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, and Ruslan Salakhutdinov. Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025
-
[41]
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.Advances in Neural Information Processing Systems, 38:30865–30891, 2026
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.Advances in Neural Information Processing Systems, 38:30865–30891, 2026
2026
-
[43]
Vagen: Reinforcing world model reasoning for multi-turn vlm agents.Advances in Neural Information Processing Systems, 38:172871–172933, 2026
Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Yiping Lu, Zhengyuan Yang, Lijuan Wang, et al. Vagen: Reinforcing world model reasoning for multi-turn vlm agents.Advances in Neural Information Processing Systems, 38:172871–172933, 2026
2026
-
[44]
WebXSkill: Skill Learning for Autonomous Web Agents
Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, et al. Webxskill: Skill learning for autonomous web agents.arXiv preprint arXiv:2604.13318, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning
Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, et al. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7920–7939, 2025
2025
-
[47]
Gui-actor: Coordinate-free visual grounding for gui agents.arXiv preprint arXiv:2506.03143, 2025
Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents.arXiv preprint arXiv:2506.03143, 2025. 16 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
-
[48]
Os-atlas: Foundation action model for generalist gui agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: Foundation action model for generalist gui agents. InInternational Conference on Learning Representations, volume 2025, pages 5090–5108, 2025
2025
-
[49]
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
An illusion of progress? assessing the current state of web agents
Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents. InSecond Conference on Language Modeling, 2025
2025
-
[51]
Magma: A foundation model for multimodal ai agents
Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the computer vision and pattern recognition conference, pages 14203–14214, 2025
2025
-
[52]
Agentoccam: A simple yet strong baseline for llm-based web agents
Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik A Chaudhari, George Karypis, and Huzefa Rangwala. Agentoccam: A simple yet strong baseline for llm-based web agents. In International Conference on Learning Representations, volume 2025, pages 97533–97565, 2025
2025
-
[53]
Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Regularizing hidden states enables learning generalizable reward model for llms.Advances in Neural Information Processing Systems, 37:62279–62309, 2024
Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states enables learning generalizable reward model for llms.Advances in Neural Information Processing Systems, 37:62279–62309, 2024
2024
-
[55]
GUI-Libra: Training native GUI agents to reason and act with action-aware supervision and partially verifiable RL, 2026
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, and Tong Zhang. GUI-Libra: Training native GUI agents to reason and act with action-aware supervision and partially verifiable RL, 2026
2026
-
[56]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
2023
-
[57]
Kuai Yu, Naicheng Yu, Han Wang, Rui Yang, and Huan Zhang. How do visual attributes influence web agents? a comprehensive evaluation of user interface design factors.arXiv preprint arXiv:2601.21961, 2026
-
[58]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024
2024
-
[60]
Beat: Visual backdoor attacks on vlm-based embodied agents via contrastive trigger learning, 2026
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui, Yu-Xiong Wang, Huan Zhang, Heng Ji, and Daniel Kang. Beat: Visual backdoor attacks on vlm-based embodied agents via contrastive trigger learning, 2026
2026
-
[61]
AgentRL: Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025
Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, Rui Lu, Hongning Wang, Jie Tang, and Yuxiao Dong. AgentRL: Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025
2025
-
[62]
LlamaFactory: Unified efficient fine-tuning of 100+ language models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 17 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual We...
2024
-
[63]
Deepresearcher: Scaling deep research via reinforcement learning in real-world environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431, 2025
2025
-
[64]
Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents
Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Erran Li. Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents. InInternational Conference on Machine Learning, 2025
2025
-
[65]
Workforceagent-r1: Incentivizing reasoning capability in llm-based web agents via reinforcement learning
Yuchen Zhuang, Di Jin, Jiaao Chen, Wenqi Shi, Hanrui Wang, and Chao Zhang. Workforceagent-r1: Incentivizing reasoning capability in llm-based web agents via reinforcement learning. InFindings of the Association for Computational Linguistics: EACL 2026, pages 34–49, 2026. 18 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Ag...
2026
-
[66]
executed. Note: no visible navigation or new tab detected. writeFocused element and typed contentSucceed: ‘write‘ typed "Alpine Ridge" into <input> role=combobox "Search". writeActual value mismatchNote: the field’s actual value is "New York, NY", which differs from the typed text. scroll Scroll direction, amount, and boundary detection Succeed: ‘scroll‘ ...
-
[67]
name": ...,
Observation, including tab info, screenshot, and optional A11y Tree 30 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents ## Output Requirements Your output must include one<think>block and one or more<tool call>blocks. The response must follow this exact structure: <think>...</think> <tool call>{"name": ..., "arguments...
-
[68]
Web Task Instruction: This is a clear and specific directive provided in natural language, detailing the online activity to be carried out
-
[69]
Use it to understand what the agent tried to do, but do not treat it as ground truth if it conflicts with the screenshots
Agent Action History: This is a chronological summary of the agent’s observed actions across steps. Use it to understand what the agent tried to do, but do not treat it as ground truth if it conflicts with the screenshots
-
[70]
point 2d
Result Screenshots: This is a visual representation of the screen showing the result or intermediate state of performing a web task. Each screenshot will be annotated with an inferred step index in text. 4.Result Response: This is a textual response obtained after the execution of the web task. ### TASK:{} ### Agent Action History:{} ### Result Response:{...
-
[71]
SHOP MEN'S
Add to cart I see there's a "SHOP MEN'S" link which would be a good starting point. ... I should click on the "SHOP MEN'S" link to navigate to the men's department page. Action: {"name": "click" , "arguments": {"point_2d": [308,530]}} 1 Reasoning: The user wants to add the most top-selling Birkenstock's men clogs in brown, size 10-10.5 to the cart. ... I'...
-
[72]
name": "hover
Look for a way to sort by customer rating … I should hover over the product images or names to see if detailed star ratings become visible … Action: {"name": "hover" , "arguments": {"point_2d": [344,584]}} 9 Reasoning:The hover action over the KIVIK Sofa … didn't surface any customer star ratings as previously hoped. ... The MANNARP at $429 might be a can...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.