arxiv: 2605.07134 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Region4Web: Rethinking Observation Space Granularity for Web Agents

Donguk Kwon , Dongha Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords web agentsobservation spacefunctional regionsAXTreePageDigestRegion4WebWebArenatask success

0 comments

The pith

Web agents achieve higher task success with shorter observations by using functional regions instead of element-level details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that web agents have relied on observation spaces at the same fine element-level granularity as their actions, which leaves the page's overall functional organization implicit and requires the agent to reconstruct it repeatedly from raw signals. The authors instead advocate for observations at the level of functional regions, defined as page parts that each serve a distinct purpose. They introduce Region4Web to reorganize the AXTree via hierarchical decomposition and semantic abstraction, then deliver the result through PageDigest as a compact, persistent per-page digest. Experiments on WebArena show this yields shorter observations and higher success rates across multiple LLMs and agent methods, independent of model size. A reader would care because the result points to a basic redesign of how agents perceive web pages that improves performance without requiring larger backbones.

Core claim

Reorganizing the AXTree into functional regions through hierarchical decomposition and semantic abstraction exposes the page's functional organization as a more compact and informative basis for the actor agent than element-level signals, and PageDigest supplies this region-level view as a persistent per-page digest that improves task success rates on WebArena while reducing observation length across diverse LLMs and established agent methods.

What carries the argument

Region4Web framework that hierarchically decomposes and semantically abstracts the AXTree into functional regions, delivered via PageDigest as a compact per-page inference pipeline that persists across agent steps.

Load-bearing premise

The hierarchical decomposition and semantic abstraction of the AXTree into functional regions accurately captures each page's functional organization and supplies a strictly more useful signal to the agent than element-level observations.

What would settle it

An experiment on WebArena tasks that keeps the same agents and LLMs but swaps PageDigest for standard element-level AXTree observations and measures whether success rates stay the same or fall.

Figures

Figures reproduced from arXiv: 2605.07134 by Dongha Lee, Donguk Kwon.

**Figure 3.** Figure 3: Overview of Region4Web inference process. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of PageDigest. Further details on AXTree preprocessing, dataset construction, and Region4Web implementation are provided in Appendices D, E, and F, respectively. 4 PageDigest Region4Web produces region-level observation for a given page, but deploying it in web environments requires focusing the observation on what is task-relevant and tracking how pages change as the agent acts. We propose PageDi… view at source ↗

**Figure 6.** Figure 6: Failure mode distribution under PageDigest on WebArena. triggering step and label every PageDigest stage, as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Edge-level F1 on training and validation sets over 140 epochs. Epoch 125 is selected for deployment [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Training and validation loss over 90 epochs. Step 65,350 is selected for deployment. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for decomposition annotation stage. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for partition verification stage. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for abstraction, used across annotation, training, and inference. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for task-relevant region selection. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt template for action selection. {action_space} is replaced with the set of 15 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for action selection with PageDigest. Only the additions to Figure 13 are shown. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

read the original abstract

Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice. Existing work treats observation at the same element-level granularity as the action space, leaving the page's functional organization implicit and forcing the agent to infer it from element-level signals at every step. We argue observation should instead operate at the granularity of functional regions, parts of the page that each serve a distinct purpose. We propose Region4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction, exposing the page's functional organization as the basis for page state understanding. Moreover, we propose PageDigest, a web-specific inference pipeline that delivers this region-level observation to the actor agent as a compact per-page digest that persists across steps. On the WebArena benchmark, PageDigest substantially reduces observation length while improving overall task success rate across diverse backbone large language models (LLMs) and established agent methods, regardless of backbone capacity. These results show that operating at the granularity of functional regions delivers a more compact and informative basis for the actor agent than element-level processing alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Region4Web's shift to region-level observations for web agents looks promising on WebArena but the abstract leaves the source of the gains unclear.

read the letter

The main thing to know is that this paper separates observation granularity from action granularity in web agents and shows that region-level views can be more compact and effective. They reorganize the AXTree into functional regions using hierarchical decomposition and semantic abstraction, then deliver it via PageDigest as a compact per-page summary that persists across steps. This is new in the web agent literature as far as the abstract goes. It does a good job highlighting how element-level observations leave structure implicit and force the agent to figure out the page organization each time. The empirical part reports shorter observations and higher success rates across LLMs and methods, which is encouraging if it holds. The soft spots are around the reliability of that abstraction. Without details on controls or error cases, it's hard to tell if the gains come from better organization or just from how the digest is formatted. The concern about losing details on tricky pages is worth checking in the full text, especially if the semantic step uses LLM calls that could err. This is for people building or studying web agents. A reader looking for new ways to structure inputs for agents would get value from the framework. It has enough of an idea to merit peer review, even if the experiments need tightening with ablations and more analysis.

Referee Report

3 major / 2 minor

Summary. The paper argues that web agents should perceive pages at the granularity of functional regions rather than element-level signals from the AXTree. It introduces Region4Web, which performs hierarchical decomposition and semantic abstraction to expose functional organization, and implements this via the PageDigest pipeline that supplies a compact, persistent per-page digest to the actor. On WebArena, the authors report that PageDigest reduces observation length while raising task success rates across multiple LLMs and agent frameworks, independent of backbone capacity.

Significance. If the empirical results are robust, the work supplies concrete evidence that observation-space granularity is an under-optimized design axis for web agents. By making the page’s functional organization explicit rather than implicit, the approach could improve both efficiency and reliability of LLM-based web navigation, with potential transfer to other structured environments that possess accessibility trees.

major comments (3)

[Experimental Evaluation] Experimental section: the abstract states clear gains in success rate and observation length, yet supplies no information on the number of evaluation runs, statistical significance tests, variance across seeds, or the precise baseline implementations (including how element-level AXTree observations were formatted and tokenized for the same agent scaffolds). Without these controls the data-to-claim link cannot be verified.
[Region4Web Framework] Method (hierarchical decomposition and semantic abstraction): the central claim requires that the region-level digest is strictly more informative than raw element-level signals. The manuscript does not describe safeguards against loss of task-critical details (labels, states, inter-element relationships) when the LLM-based semantic inference step misclassifies or elides elements on JavaScript-heavy or non-standard WebArena pages.
[Results] Results analysis: the claim that gains hold “regardless of backbone capacity” is load-bearing for the generality argument, yet no capacity-stratified breakdown, ablation on model scale, or per-task error analysis is provided to support it.

minor comments (2)

[Abstract] The abstract introduces “PageDigest” and “functional regions” without a one-sentence definition; a brief parenthetical gloss would aid readers.
[Method] Notation for the AXTree regions and the digest format should be introduced once with a small illustrative figure or table early in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps improve the clarity and rigor of our work on observation granularity for web agents. We address each major comment below and have updated the manuscript to incorporate the suggested details and analyses.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental section: the abstract states clear gains in success rate and observation length, yet supplies no information on the number of evaluation runs, statistical significance tests, variance across seeds, or the precise baseline implementations (including how element-level AXTree observations were formatted and tokenized for the same agent scaffolds). Without these controls the data-to-claim link cannot be verified.

Authors: We agree these experimental controls are necessary for verifying the claims. In the revised manuscript, we have added a new subsection under Experiments detailing the protocol: all results are averaged over 5 independent runs with distinct random seeds, reporting mean success rates and observation lengths along with standard deviations. We include paired t-tests confirming statistical significance (p < 0.05) for the reported gains. We also specify the exact formatting and tokenization of element-level AXTree baselines to match the agent scaffolds used with PageDigest, ensuring direct comparability. revision: yes
Referee: [Region4Web Framework] Method (hierarchical decomposition and semantic abstraction): the central claim requires that the region-level digest is strictly more informative than raw element-level signals. The manuscript does not describe safeguards against loss of task-critical details (labels, states, inter-element relationships) when the LLM-based semantic inference step misclassifies or elides elements on JavaScript-heavy or non-standard WebArena pages.

Authors: This is a valid concern for the semantic abstraction step. The revised Method section now explicitly describes safeguards: the hierarchical decomposition preserves the complete AXTree (including all labels, states, and relationships) prior to abstraction; the PageDigest pipeline includes a conservative prompting strategy that prioritizes retention of task-critical attributes; and we added an analysis of potential misclassifications on JavaScript-heavy WebArena pages, showing that critical elements are rarely elided and providing concrete examples of how the digest handles non-standard structures without information loss for navigation tasks. revision: yes
Referee: [Results] Results analysis: the claim that gains hold “regardless of backbone capacity” is load-bearing for the generality argument, yet no capacity-stratified breakdown, ablation on model scale, or per-task error analysis is provided to support it.

Authors: We acknowledge that the generality claim benefits from explicit stratification. The revised Results section includes a new table and figure providing capacity-stratified breakdowns across model scales (small: <10B, medium: 10-30B, large: >30B parameters) for the tested LLMs, demonstrating consistent improvements independent of backbone size. We also added a per-task error analysis categorizing failure modes (e.g., navigation vs. form-filling errors) and showing how region-level observations reduce specific error types across scales. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation on external benchmark with no self-referential derivations

full rationale

The paper introduces Region4Web and PageDigest as a framework for reorganizing AXTree into functional regions via hierarchical decomposition and semantic abstraction, then reports empirical gains in observation length and task success on the WebArena benchmark across LLMs and agent methods. No equations, fitted parameters, or first-principles derivations are claimed. The central result is an observed performance delta on an external benchmark, not a quantity reduced to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is self-contained as a proposal plus independent evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical performance of the proposed region-level observation pipeline; the abstract introduces no free parameters, domain axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5488 in / 1187 out tokens · 48614 ms · 2026-05-11T02:07:31.525655+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Region4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PageDigest substantially reduces observation length while improving overall task success rate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

[1]

Gemini 3.1 flash-lite: Built for intelligence at scale, 2026

Google AI. Gemini 3.1 flash-lite: Built for intelligence at scale, 2026

work page 2026
[2]

Trafilatura: A web scraping library and command-line tool for text discovery and extraction

Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2021

work page 2021
[3]

What do you see when you’re surfing?: using eye tracking to predict salient regions of web pages.Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2009

Georg Buscher, Edward Cutrell, and Meredith Ringel Morris. What do you see when you’re surfing?: using eye tracking to predict salient regions of web pages.Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2009

work page 2009
[4]

Vips: a vision-based page segmentation algorithm, 2003

Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Vips: a vision-based page segmentation algorithm, 2003

work page 2003
[5]

Web agents with world models: Learning and leveraging environment dynamics in web navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. InInternational Conference on Learning Representations, 2025

work page 2025
[6]

An index-based approach for efficient and effective web content extraction, 2025

Yihan Chen, Benfeng Xu, Xiaorui Wang, and Zhendong Mao. An index-based approach for efficient and effective web content extraction, 2025

work page 2025
[7]

Deepseek-v3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

work page 2025
[8]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[9]

Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks? InThe Forty-First International Conference on Machine Learning, 2024

work page 2024
[10]

Read anywhere pointed: Layout-aware gui screen reading with tree-of-lens grounding

Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, and Xin Eric Wang. Read anywhere pointed: Layout-aware gui screen reading with tree-of-lens grounding. In The 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[11]

Webclasseg-25: A dual-classified webpage segmentation dataset

Jonathan Gerber, Jasmin Saxer, Kimia Rabishokr, Bruno Kreiner, and Andreas Weiler. Webclasseg-25: A dual-classified webpage segmentation dataset. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025

work page 2025
[12]

Web-cogreasoner: Towards knowledge-induced cognitive reasoning for web agents

Yuhan Guo, Cong Guo, Aiwen Sun, Hongliang He, Xinyu Yang, Yue Lu, Yingji Zhang, Xuntao Guo, Dong Zhang, Jianzhuang Liu, Jiang Duan, Yijia Xiao, Liangjian Wen, Hai-Ming Xu, and Yong Dai. Web-cogreasoner: Towards knowledge-induced cognitive reasoning for web agents. InInternational Conference on Learning Representations, 2026

work page 2026
[13]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[14]

R2d2: Remembering, reflecting and dynamic decision making for web agents

Peng Huang, Xin Zheng, Jiayi Lin, Yuxiang Zhang, Jingkai Zhou, Zhicheng Yang, Ruibin Yuan, Zhenghao Liu, Yukun Yan, Ge Zhang, and Wenhao Huang. R2d2: Remembering, reflecting and dynamic decision making for web agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025
[15]

Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents, 2025

work page 2025
[16]

Web page segmentation revisited: Evaluation framework and dataset

Johannes Kiesel, Lars Meyer, Florian Kneist, Benno Stein, and Martin Potthast. Web page segmentation revisited: Evaluation framework and dataset. InProceedings of the 29th ACM International Conference on Information and Knowledge Management, 2020

work page 2020
[17]

An empirical comparison of web page segmentation algorithms

Johannes Kiesel, Lars Meyer, Florian Kneist, Benno Stein, and Martin Potthast. An empirical comparison of web page segmentation algorithms. InProceedings of the 43rd European Conference on IR Research, 2021

work page 2021
[18]

Learning to contextualize web pages for enhanced decision making by llm agents

Dongjun Lee, Juyong Lee, Kyuyoung Kim, Jihoon Tack, Jinwoo Shin, Yee Whye Teh, and Kimin Lee. Learning to contextualize web pages for enhanced decision making by llm agents. InInternational Conference on Learning Representations, 2025. 10

work page 2025
[19]

Pytorch distributed: experiences on accelerating data parallel training.Proc

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: experiences on accelerating data parallel training.Proc. VLDB Endow., 2020

work page 2020
[20]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, 2017

work page 2017
[21]

Reinforcement learn- ing on web interfaces using workflow-guided exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learn- ing on web interfaces using workflow-guided exploration. InInternational Conference on Learning Representations, 2018

work page 2018
[22]

Dripper: Token-efficient main html extraction with a lightweight lm, 2025

Mengjie Liu, Jiahui Peng, Wenchang Ning, Pei Chu, Jiantao Qiu, Ren Ma, He Zhu, Rui Min, Lindong Lu, Linfeng Hou, Kaiwen Liu, Yuan Qu, Zhenxiang Li, Chao Xu, Zhongying Tu, Wentao Zhang, and Conghui He. Dripper: Token-efficient main html extraction with a lightweight lm, 2025

work page 2025
[23]

Visualagentbench: Towards large multimodal models as visual foundation agents

Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, and Jie Ta...

work page 2025
[24]

Scaling web agent training through automatic data generation and fine-grained evaluation

Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Creighton Glasscock, and Honglak Lee. Scaling web agent training through automatic data generation and fine-grained evaluation. InSecond Conference on Language Modeling, 2025

work page 2025
[25]

Yihong Ma, Yijun Tian, Nuno Moniz, and Nitesh V . Chawla. Class-imbalanced learning on graphs: A survey.ACM Computing Survey, 2025

work page 2025
[26]

Focusagent: Simple yet effective ways of trimming the large context of web agents, 2025

Anita Moskaleva, Mohamed Abdelhady, Angelos Katharopoulos, Daniel Toyama, and Simon Schug. Focusagent: Simple yet effective ways of trimming the large context of web agents, 2025

work page 2025
[27]

Convolutional neural networks over tree structures for programming language processing

Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. Convolutional neural networks over tree structures for programming language processing. InProceedings of the 30th AAAI Conference on Artificial Intelligence, 2016

work page 2016
[28]

Openai gpt-5 system card, 2025

OpenAI. Openai gpt-5 system card, 2025

work page 2025
[29]

Webcanvas: Benchmarking web agents in online environments, 2024

Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, and Zhengyang Wu. Webcanvas: Benchmarking web agents in online environments, 2024

work page 2024
[30]

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. InInternational Conference on Learning Representations, 2025

work page 2025
[31]

Schiepanski and Nicholas Piël

Thassilo M. Schiepanski and Nicholas Piël. Beyond pixels: Exploring dom downsampling for llm-based web agents, 2025

work page 2025
[32]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[33]

Trishul: Towards region identification and screen hierarchy understanding for large vlm based gui agents

Kunal Singh, Shreyas Singh, and Mukund Khanna. Trishul: Towards region identification and screen hierarchy understanding for large vlm based gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025

work page 2025
[34]

Paloma Sodhi, S. R. K. Branavan, Yoav Artzi, and Ryan McDonald. Step: Stacked llm policies for web actions. InFirst Conference on Language Modeling, 2024

work page 2024
[35]

Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured long short-term memory networks. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015

work page 2015
[36]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 11

work page 2025
[37]

Readerlm-v2: Small language model for html to markdown and json, 2025

Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, and Han Xiao. Readerlm-v2: Small language model for html to markdown and json, 2025

work page 2025
[38]

Webformer: The web-page transformer for structure information extraction

Qifan Wang, Yi Fang, Anirudh Ravula, Fuli Feng, Xiaojun Quan, and Dongfang Liu. Webformer: The web-page transformer for structure information extraction. InProceedings of the ACM Web Conference 2022, 2022

work page 2022
[39]

Modular tree network for source code representation learning.ACM Transactions on Software Engineering and Methodology, 2021

Wenhan Wang, Ge Li, Sijie Shen, Xin Xia, and Zhi Jin. Modular tree network for source code representation learning.ACM Transactions on Software Engineering and Methodology, 2021

work page 2021
[40]

Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning

Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, and Lihong Li. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. InThe 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[41]

Webwalker: Benchmarking llms in web traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025
[42]

An illusion of progress? assessing the current state of web agents

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents. InSecond Conference on Language Modeling, 2025

work page 2025
[43]

Agentoccam: A simple yet strong baseline for llm-based web agents

Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaudhari, George Karypis, and Huzefa Rangwala. Agentoccam: A simple yet strong baseline for llm-based web agents. InInternational Conference on Learning Representations, 2025

work page 2025
[44]

Re- act: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[45]

Grown+up: A graph representation of a webpage network utilizing pre-training

Benedict Yeoh and Huijuan Wang. Grown+up: A graph representation of a webpage network utilizing pre-training. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022

work page 2022
[46]

A novel neural source code representation based on abstract syntax tree

Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. A novel neural source code representation based on abstract syntax tree. InProceedings of the 41st International Conference on Software Engineering, 2019

work page 2019
[47]

Prune4web: Dom tree pruning programming for web agent

Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu, and Jing Zhang. Prune4web: Dom tree pruning programming for web agent. InProceedings of the 40th AAAI Conference on Artificial Intelligence, 2026

work page 2026
[48]

Plan-mcts: Plan exploration for action exploitation in web navigation, 2026

Weiming Zhang, Jihong Wang, Jiamu Zhou, Qingyao Li, Xinbei Ma, Congmin Zheng, Xingyu Lou, Weiwen Liu, Zhuosheng Zhang, Jun Wang, Yong Yu, and Weinan Zhang. Plan-mcts: Plan exploration for action exploitation in web navigation, 2026

work page 2026
[49]

Gpt-4v(ision) is a generalist web agent, if grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. InThe Forty-First International Conference on Machine Learning, 2024

work page 2024
[50]

Search" and a

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. A Limitations and Future Work Region4Web operates over the AXTree, ...

work page 2024
[51]

Always evaluate their children

Structural containers (the tree root, ARIA landmarks like banner/main/contentinfo) group content by page position, not by purpose. Always evaluate their children. A container becomes a region only for children that do not form their own regions

work page
[52]

Purely decorative elements and isolated utility shortcuts belong to their parent’s region

A region must be meaningful to an agent, something it would need to independently recognize or interact with to carry out a task. Purely decorative elements and isolated utility shortcuts belong to their parent’s region. </constraints> <algorithm> To evaluate a node N, apply these steps in order. Step 1. Container passthrough. If N is the tree root or an ...

work page
[53]

The region corresponds to one recognizable functional unit on the page, an area that an agent would identify as serving a single role

work page
[54]

Those areas should be separate regions

If you can identify multiple sub-components within the region that are each independently recognizable as their own functional area on the page, the region is incorrectly formed. Those areas should be separate regions. </criteria> <output_format> Output the IDs of incorrectly formed regions as a comma-separated list (e.g., R3, R7). If no region is incorre...

work page
[55]

This should name the type of region, not describe its current contents or enumerate its features

purpose: Identify the collective function that the region’s elements are organized to serve. This should name the type of region, not describe its current contents or enumerate its features. Write a short noun phrase

work page
[56]

Lead with the key information an agent would match against a task, not with descriptions of what the region shows

state_summary: Interpret the region’s current content and available actions to inform task-based decision making. Lead with the key information an agent would match against a task, not with descriptions of what the region shows. Write one to two concise sentences. </task> <guidelines> - Derive both fields solely from the elements present in the subtree. -...

work page
[57]

First understand what the current page offers from the full set of region abstractions. Then, given the task and the action history, select every region whose content could be relevant to the task, whether the agent needs to interact with it or read information from it. Do not exclude regions based on an assumed course of action

work page
[58]

If relevance cannot be determined from the description, include the region

Exclude a region only when its purpose is clearly unrelated to the task. If relevance cannot be determined from the description, include the region

work page
[59]

When multiple regions share a similar purpose and their state summaries do not indicate which ones the task requires, include all of them

work page
[60]

The rendered content may not satisfy the task’s exact requirements

A state summary that appears to match the task does not by itself justify excluding other potentially relevant regions. The rendered content may not satisfy the task’s exact requirements. </principles> <output_format> Output the selected region IDs as a comma-separated list (e.g., R3, R7). After the list, output nothing further. </output_format> Task: {ta...

work page