arxiv: 2603.05044 · v2 · submitted 2026-03-05 · 💻 cs.AI

Recognition: no theorem link

WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

Sicheng Fan , Qingyun Shi , Shengze Xu , Shengbo Cai , Tieyong Zeng , Li Ling , Yanyi Shang , Dehan Kong

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords GUI agentsweb agentsreinforcement learningsynthetic dataLLM groundingautomated trainingembodied agentsfoundation models

0 comments

The pith

WebFactory's automated pipeline compresses an LLM's web knowledge into agents that match human-trained performance using synthetic data from only 10 sites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current training for GUI agents depends on either unsafe live web sessions or expensive human-labeled data across many environments. The paper shifts focus to how efficiently a large language model's existing knowledge can be turned into reliable actions rather than simply collecting more data. WebFactory runs a closed loop of generating synthetic websites, creating tasks, collecting trajectories with an LLM, and training via decomposed-reward reinforcement learning. An agent built this way on data from just 10 sites reaches the same level of success as agents trained on equal amounts of human data drawn from far larger collections of sites. The result holds on both internal offline tests and online transfer benchmarks, and the trained agent also beats its own base LLM.

Core claim

WebFactory shows that a fully automated pipeline of environment synthesis, knowledge-aware task generation, LLM-driven trajectory collection, and decomposed-reward RL training can convert the latent internet knowledge inside a foundation model into grounded, executable web-agent behavior. When the resulting agent is trained solely on synthetic data from ten websites, it achieves performance comparable to agents trained on the same volume of human-annotated data collected across many more environments, and it outperforms the base foundation model on both offline and online transfer benchmarks.

What carries the argument

The WebFactory closed-loop reinforcement learning pipeline that automates scalable environment synthesis from real websites, task generation, LLM-powered trajectory collection, and decomposed-reward training to convert passive language-model knowledge into active interaction policies.

If this is right

Agent training becomes scalable without human annotation or exposure to live unsafe web traffic.
Performance on offline and online transfer benchmarks remains competitive with agents trained on much larger human datasets.
The trained agent reliably exceeds the base foundation model on web-interaction tasks.
Different foundation models exhibit measurable differences in how readily their knowledge can be embodied as agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis-and-compression loop could be applied to other interface domains such as mobile apps or desktop software once suitable environment generators exist.
If the approximation between synthetic and real sites holds, data volume may matter less than the quality of the knowledge-extraction step for building interactive agents.
The pipeline supplies a concrete way to compare foundation models on an embodiment axis rather than on text benchmarks alone.
Extending the reward decomposition or trajectory collection to handle more dynamic page elements could further reduce any remaining distribution shift.

Load-bearing premise

Synthetic websites and LLM-generated trajectories approximate real web pages and user interactions closely enough that the trained agent will not suffer large performance drops when moved to actual live sites.

What would settle it

Deploy the trained agent on a fresh set of real-world websites never seen during synthesis and measure whether its task-completion rate drops substantially below the rate achieved by the human-data baseline on identical tasks.

Figures

Figures reproduced from arXiv: 2603.05044 by Dehan Kong, Li Ling, Qingyun Shi, Shengbo Cai, Shengze Xu, Sicheng Fan, Tieyong Zeng, Yanyi Shang.

**Figure 1.** Figure 1: Overview of the WebFactory, which compresses foundation-model intelligence into grounded GUI agents through three stages: high-fidelity offline environment & task synthesis, scalable trajectory generation, and unified-action RL training. 2 METHOD 2.1 A HIGH-FIDELITY, FULLY CONTROLLABLE WEB ENVIRONMENT To enable scalable data generation and automated RL training for web agents, we develop a fully controlla… view at source ↗

**Figure 2.** Figure 2: Representative offline websites from our curated environment (6 of 10 shown). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison of agents trained with data generated by different foundation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Ground truth action distribution in the dataset. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Action transition heatmap showing transition counts between actions. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non-reproducible live web interactions or costly, scarce human-crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model's (LLM) latent knowledge into actionable agent behavior. We introduce WebFactory, a novel, fully automated closed-loop reinforcement learning pipeline for GUI agents, systematically compressing LLM-encoded internet intelligence into efficient, grounded actions. Our pipeline features a process of scalable environment synthesis, knowledge-aware task generation, LLM-powered trajectory collection, decomposed reward RL training, and systematic agent evaluation. Remarkably, our agent demonstrates exceptional data efficiency and generalization. Trained on synthetic data from only 10 websites within WebFactory, it achieves performance comparable to GUI agents trained on the same amount of human-annotated data from a much larger set of environments. This superior performance is consistent across our internal offline and online transfer benchmarks, where our agent also significantly outperforms the base foundation model. We further provide critical insights into the "embodiment potential" of different LLM foundations, offering a new axis for model evaluation. This work presents a scalable and cost-effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a critical step towards general-purpose interactive agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebFactory packages an automated closed-loop pipeline for synthetic web environments and decomposed-reward RL, but the abstract supplies no numbers to support the data-efficiency claim.

read the letter

The paper's main contribution is a fully automated pipeline that generates synthetic web environments from only 10 sites, creates knowledge-aware tasks, collects LLM trajectories, and trains agents with decomposed-reward RL. It claims the resulting agent matches the performance of agents trained on human-annotated data from a much larger set of environments, while also beating the base LLM on internal benchmarks. That closed-loop structure is what is new relative to earlier GUI-agent papers that relied on manual environments or live web scraping.

Referee Report

3 major / 1 minor

Summary. The paper introduces WebFactory, a fully automated closed-loop RL pipeline for training GUI agents. It performs scalable environment synthesis, knowledge-aware task generation, LLM-powered trajectory collection, and decomposed-reward training, all starting from synthetic data generated on only 10 websites. The central empirical claim is that the resulting agent matches the performance of GUI agents trained on equivalent volumes of human-annotated data drawn from a much larger set of environments, while also significantly outperforming the base foundation model on the authors' internal offline and online transfer benchmarks. The work additionally reports insights into the 'embodiment potential' of different LLMs.

Significance. If the quantitative results and generalization claims are substantiated with rigorous evidence, the paper would demonstrate a scalable route to data-efficient compression of LLM knowledge into grounded web agents, substantially reducing dependence on costly human annotation and unsafe live-web interaction. It would also supply a new axis for evaluating foundation models and a reproducible template for closed-loop synthetic-data pipelines in interactive settings.

major comments (3)

[Abstract] Abstract: the claims that the agent 'achieves performance comparable to GUI agents trained on the same amount of human-annotated data from a much larger set of environments' and 'significantly outperforms the base foundation model' are stated without any numerical metrics, baselines, error bars, or statistical tests. Because these assertions constitute the headline result, their absence renders the central claim unsupported by visible evidence.
[Pipeline Description] Pipeline and Evaluation sections: no quantitative diagnostics (KL divergence on page-state distributions, action-coverage statistics, or failure-mode overlap) are supplied to measure how closely the synthetic environments and LLM-collected trajectories approximate real-web interaction distributions. Without such measures, the data-efficiency claim rests on an unverified assumption that the 10-site synthetic distribution is sufficiently representative.
[Evaluation] Evaluation: the internal offline and online transfer benchmarks are invoked to support both comparability and outperformance, yet the manuscript provides no description of how these benchmarks were constructed, which public or external datasets (if any) were used, or how environments and success metrics were chosen to avoid favoring the proposed synthesis pipeline.

minor comments (1)

[Abstract] Abstract: the phrase 'embodiment potential' is introduced without a concise definition or citation to prior usage, which may hinder readers' immediate understanding of the new evaluation axis being proposed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We appreciate the emphasis on strengthening the empirical presentation and will revise the manuscript accordingly to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the claims that the agent 'achieves performance comparable to GUI agents trained on the same amount of human-annotated data from a much larger set of environments' and 'significantly outperforms the base foundation model' are stated without any numerical metrics, baselines, error bars, or statistical tests. Because these assertions constitute the headline result, their absence renders the central claim unsupported by visible evidence.

Authors: We agree that the abstract should include key quantitative results to make the central claims immediately verifiable. In the revised manuscript we will update the abstract to report specific success rates on the transfer benchmarks (with error bars), direct numerical comparisons to the human-annotated baseline and the base foundation model, and a brief reference to the statistical tests performed. revision: yes
Referee: [Pipeline Description] Pipeline and Evaluation sections: no quantitative diagnostics (KL divergence on page-state distributions, action-coverage statistics, or failure-mode overlap) are supplied to measure how closely the synthetic environments and LLM-collected trajectories approximate real-web interaction distributions. Without such measures, the data-efficiency claim rests on an unverified assumption that the 10-site synthetic distribution is sufficiently representative.

Authors: We acknowledge the utility of explicit distributional diagnostics. We will add action-coverage statistics and failure-mode overlap analysis to the revised Pipeline and Evaluation sections. KL divergence on full page-state distributions is not feasible given the absence of a large, safe real-web corpus; we will instead rely on and report the downstream performance metrics as the primary validation of representativeness while noting this limitation. revision: partial
Referee: [Evaluation] Evaluation: the internal offline and online transfer benchmarks are invoked to support both comparability and outperformance, yet the manuscript provides no description of how these benchmarks were constructed, which public or external datasets (if any) were used, or how environments and success metrics were chosen to avoid favoring the proposed synthesis pipeline.

Authors: We agree that a transparent description of benchmark construction is required. The revised Evaluation section will include a detailed account of how the internal offline and online transfer benchmarks were assembled, the criteria used to select environments and tasks, the definition of success metrics, and the steps taken to mitigate bias toward the synthetic pipeline. Although the benchmarks are internal and not drawn from public datasets, we will supply sufficient methodological detail for readers to assess fairness. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical closed-loop pipeline (environment synthesis, LLM trajectory collection, decomposed-reward RL) whose headline claim of data-efficient generalization rests on internal offline/online benchmarks. No equations, definitions, or self-citations are shown that reduce the reported performance comparability to a tautology or fitted input by construction. The approximation quality of synthetic sites to live web distributions is an explicit modeling assumption, not a hidden self-referential step. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on the unverified premise that synthetic data and LLM-generated trajectories transfer to real web performance; no explicit free parameters are named in the abstract but reward decomposition weights and environment synthesis rules are implicit and unstated.

axioms (1)

domain assumption Synthetic web environments generated automatically can stand in for real websites during training and evaluation.
Invoked when claiming comparable performance from 10 synthetic sites versus human data from many more environments.

pith-pipeline@v0.9.0 · 5550 in / 1295 out tokens · 38660 ms · 2026-05-15T16:46:29.831402+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Osworld-human: Benchmarking the efficiency of computer-use agents.arXiv preprint arXiv:2506.16042,

Reyna Abhyankar, Qi Qi, and Yiying Zhang. Osworld-human: Benchmarking the efficiency of computer-use agents.arXiv preprint arXiv:2506.16042,

work page arXiv
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen2.5-VL Technical Report

URLhttps://arxiv.org/abs/2502.13923. Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Guicourse: From general vision language models to versatile gui agents.arXiv preprint arXiv:2406.11317,

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language models to versatile gui agents.arXiv preprint arXiv:2406.11317,

work page arXiv
[5]

Babyai: A platform to study the sample efficiency of grounded language learning.arXiv preprint arXiv:1810.08272,

Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. Babyai: A platform to study the sample efficiency of grounded language learning.arXiv preprint arXiv:1810.08272,

work page arXiv
[6]

The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

work page arXiv
[7]

Multimodal web navigation with instruction-finetuned foundation models

Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, and Izzeddin Gur. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854,

work page arXiv
[8]

Go-browse: Training web agents with structured exploration

Apurva Gandhi and Graham Neubig. Go-browse: Training web agents with structured exploration. arXiv preprint arXiv:2506.03533,

work page arXiv
[9]

Real: Benchmarking autonomous agents on deterministic simulations of real websites.arXiv preprint arXiv:2504.11543,

Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, et al. Real: Benchmarking autonomous agents on deterministic simulations of real websites.arXiv preprint arXiv:2504.11543,

work page arXiv
[10]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919,

work page internal anchor Pith review arXiv
[11]

arXiv:2401.13649 [cs.CL] https://arxiv.org/abs/2401.13649

11 Published as a conference paper at ICLR 2026 Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649,

work page arXiv 2026
[12]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q- learning.arXiv preprint arXiv:2110.06169,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration.arXiv preprint arXiv:1802.08802,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Harnessing webpage uis for text-rich visual understanding.arXiv preprint arXiv:2410.13824,

Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Gra- ham Neubig, and Xiang Yue. Harnessing webpage uis for text-rich visual understanding.arXiv preprint arXiv:2410.13824,

work page arXiv
[15]

Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices.arXiv preprint arXiv:2406.08451, 2024

Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, and Ping Luo. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices.arXiv preprint arXiv:2406.08451,

work page arXiv
[16]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Webchorearena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952,

Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, et al. Webchorearena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952,

work page arXiv
[18]

Nnetnav: Unsuper- vised learning of browser agents through environment interaction in the wild.arXiv preprint arXiv:2410.02907,

Shikhar Murty, Hao Zhu, Dzmitry Bahdanau, and Christopher D Manning. Nnetnav: Unsuper- vised learning of browser agents through environment interaction in the wild.arXiv preprint arXiv:2410.02907,

work page arXiv
[19]

Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents

Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, and Ahmed Hassan. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 6300– 6323,

work page 2025
[21]

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al

URLhttps://arxiv.org/abs/2406.12373. Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curricu- lum reinforcement learning.arXiv preprint arXiv:2411.02337,

work page arXiv
[22]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652,

12 Published as a conference paper at ICLR 2026 Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652,

work page 2026
[23]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[24]

Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776,

Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, and Ruslan Salakhutdinov. Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776,

work page arXiv
[25]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Webagent-r1: Training web agents via end-to-end multi-turn rein- forcement learning.arXiv preprint arXiv:2505.16421,

Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, et al. Webagent-r1: Training web agents via end-to-end multi-turn rein- forcement learning.arXiv preprint arXiv:2505.16421,

work page arXiv
[27]

Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605,

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605,

work page arXiv
[28]

Agentoccam: A simple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825,

Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaudhari, George Karypis, and Huzefa Rangwala. Agentoccam: A simple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825,

work page arXiv
[29]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023a. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing...

work page internal anchor Pith review Pith/arXiv arXiv
[30]

site recipe

13 Published as a conference paper at ICLR 2026 A DISCLOSURE OFLLMUSE. We used large language models to assist with language polishing and discovering related work. All technical claims, experiments, and analyses were designed, executed, and verified by the authors. B RELATEDWORK Web environments and benchmarks.Research on web-agent environments has gradu...

work page 2026
[31]

User-First, Model-Agnostic

G.2 REWARD Per-step reward: Rt =αR f +βR accuracy,(4) whereR f enforces structured outputs (valid JSON, tags, type constraints) andR accuracy is action- specific: 18 Published as a conference paper at ICLR 2026 Figure 5: Action transition heatmap showing transition counts between actions. Table 7: Detailed specification of the web agent action space, list...

work page 2026