arxiv: 2604.14668 · v1 · submitted 2026-04-16 · 💻 cs.HC

Recognition: unknown

Beyond Chat and Clicks: GUI Agents for In-Situ Assistance via Live Interface Transformation

Jacob Sun, Pan Hao, Qianwen Wang, Rishi Selvakumaran

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:16 UTC · model grok-4.3

classification 💻 cs.HC

keywords in-situ assistanceGUI agentsDOM manipulationweb interfacescontextual helpbrowser extensionuser assistance

0 comments

The pith

GUI agents deliver help by directly editing live web interfaces through reversible DOM changes instead of separate chats.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that assistance for complex web interfaces should appear directly inside the existing page rather than in a separate chat window or through full application redesigns. It shows this can be done with lightweight browser-level edits to the page structure that interpret a user's request, locate relevant elements, and apply temporary adjustments such as highlights or tooltips. A design space and pipeline detail how these interventions ground requests to the live interface context. Evaluations on two complex interfaces and a user study against chat-based baselines indicate the method is reliable and more usable for immediate support.

Core claim

The paper proposes in-situ assistance as a mode of support delivered directly within any live web interface through lightweight, browser-level interventions on the Document Object Model without rebuilding the application or modifying its underlying logic. A design space and computational pipeline characterize how GUI agents can insert, mutate, or recompose web elements to make interfaces easier to understand and navigate, instantiated in a Chrome extension that grounds requests to UI elements and executes reversible manipulations including contextual tooltips, control highlighting, and layout reorganization.

What carries the argument

The computational pipeline for DOM-mediated in-situ assistance that interprets user help requests and live interface context, grounds them to relevant UI elements, and executes reversible manipulations.

If this is right

In-situ assistance becomes deployable on arbitrary web interfaces without application-specific engineering.
Users receive contextual help that integrates directly into the live view through element changes.
GUI agents shift from sideline conversational support to active live interface reconfiguration.
Quantitative results confirm reliable and efficient assistance delivery on complex visual interfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other platforms if similar access to interface elements is feasible.
Agents might combine request grounding with usage pattern detection to offer adjustments proactively.
Widespread use would require safeguards for dynamic pages where structure changes rapidly.

Load-bearing premise

Lightweight reversible manipulations of page structure can be performed reliably across arbitrary web interfaces without breaking functionality.

What would settle it

Applying the pipeline to a broad sample of popular web applications and observing frequent assistance failures or unintended breaks in original interface behavior.

Figures

Figures reproduced from arXiv: 2604.14668 by Jacob Sun, Pan Hao, Qianwen Wang, Rishi Selvakumaran.

**Figure 1.** Figure 1: DOMSteer delivers assistance directly within the live interface through three DOM operations: (A) Mutate that adapts existing elements, (B) Insert that adds new contextual content, and (C) Recompose that reorganizes interface structure. Abstract Complex visual interfaces are powerful yet have a steep learning curve, as users must navigate feature-rich visual interfaces while reasoning about domain-specific… view at source ↗

**Figure 2.** Figure 2: Design space of DOM-mediated in-situ assistance [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Computational pipeline of DOMSteer. and insertBefore(). We identify two representative subtypes of recompose. Re-Group addresses WHAT and WHERE challenges by introducing or strengthening parental structure so that related elements are visually and spatially clustered, making semantic relationships more explicit. For example, data fields in a movie dataset may be grouped into named categories such as “Audi… view at source ↗

**Figure 4.** Figure 4: DOMSteer applied to TensorFlow Playground. assistance from the handbook or generates a new one on the fly, then grounds the recommended assistance to DOM elements. Retrieval-Augmented Generation. For a given user challenge, DOMSteer generates its embedding using text-embedding-3-small and searches the Assistance Handbook for the top-𝑘 (𝑘=3) most relevant cases via cosine similarity against the Rationale f… view at source ↗

**Figure 5.** Figure 5: Evaluation of LLMs across three methods. Zigzag [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Representative use cases of DOMSteer across four web interfaces, organized by challenge type. cylinders, and (T4) describing characteristics of eight-cylinder cars relative to other cylinder counts. • TensorFlow Playground. Participants completed four exploratory tasks: (T5) locating and explaining the effect of the Discretize toggle on the output visualization, (T6) locating one misclassified test data po… view at source ↗

**Figure 7.** Figure 7: User Ratings. Comparison of DOMSteerand the baseline based on the post-study questionnaire responses. Participants rated each item on a 7-point Likert scaling ranging from strongly disagree (1) to strongly agree (7). this procedure for six independent runs (48 task instances in total) and evaluated task correctness and completion time using the same rubrics as in the user study. 8.2 Quantitative Results. … view at source ↗

**Figure 8.** Figure 8: Task completion time (a) and accuracy (b) for [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Interface for coding and labeling assistance-seeking [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: The autonomous assistant interface in ChatGPT [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

read the original abstract

Complex visual interfaces are powerful yet have a steep learning curve, as users must navigate feature-rich visual interfaces while reasoning about domain-specific operations. Existing approaches either deliver assistance through a separate chat-based interaction, or require substantial application-specific engineering to build support natively into each interface. To address the gaps, we propose in-situ assistance: a mode of support delivered directly within any live web interface through lightweight, browser-level interventions on the Document Object Model (DOM), without rebuilding the application or modifying its underlying logic. We contribute a design space and a computational pipeline for DOM-mediated in-situ assistance, characterizing how GUI agents can insert, mutate, or recompose web elements to make the interface easier for users to understand, use, and navigate. We instantiate in-situ assistance in DOMSteer, a Chrome extension that interprets a user's help request and live interface context, grounds it to relevant UI elements, and executes reversible DOM manipulations directly on the live page to deliver assistance, including contextual tooltips, control highlighting, layout reorganization. Quantitative evaluations on two complex visual interfaces show that DOMSteer delivers reliable and efficient in-situ assistance. Use cases and a comparative user study with baseline ChatGPTAtlas demonstrate the usability and effectiveness of DOMSteer. Altogether, these findings point to a broader role for GUI agents: not just assisting from the sidelines, but actively reconfiguring live interfaces to support users in the moment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOMSteer gives a working prototype for live DOM edits that embed help directly in web interfaces, with a user study showing gains over chat, but the broad claims rest on only two test cases.

read the letter

The paper introduces in-situ assistance as a third path between separate chat windows and full native integration. Instead of talking to an AI or rewriting the app, DOMSteer reads a help request, grounds it to elements on the current page, and makes reversible changes to the DOM—adding tooltips, highlighting controls, or shifting layout. They built this as a Chrome extension and spelled out a design space plus a pipeline for the interventions.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes in-situ assistance as a new mode of support for complex web interfaces, achieved through lightweight, reversible interventions on the live DOM by a GUI agent in the DOMSteer Chrome extension. It contributes a design space and pipeline for inserting, mutating, or recomposing UI elements to provide contextual help, highlighting, and layout changes without modifying the underlying application. Claims are supported by quantitative evaluations on two interfaces showing reliable assistance and a comparative user study demonstrating usability over chat-based baselines.

Significance. If the approach generalizes reliably, this work could meaningfully advance GUI agents and HCI by enabling agents to actively reconfigure live interfaces in the moment rather than relying on separate chat or per-app engineering. The design space and pipeline for DOM-mediated transformations represent a concrete step toward more embedded agent assistance.

major comments (1)

[Abstract and quantitative evaluations] The central claim that lightweight, reversible DOM manipulations (insert, mutate, recompose) can be performed reliably on arbitrary live web interfaces without breaking functionality or requiring app-specific engineering (Abstract) is load-bearing for the contribution. However, quantitative evaluations are reported on only two complex interfaces; no evidence is given that the grounding pipeline or manipulation primitives handle common cases such as virtual DOMs (React/Vue), shadow DOMs, heavy event delegation, or client-side state that can invalidate direct edits even when intended to be reversible.

minor comments (1)

[Abstract] The abstract refers to 'two complex visual interfaces' without naming them; adding the specific interfaces and a brief characterization in the evaluation section would aid reader understanding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concern about the scope of our evaluations and the generalizability of the DOM manipulation pipeline by clarifying its reliance on standard web APIs and committing to expanded discussion of limitations.

read point-by-point responses

Referee: [Abstract and quantitative evaluations] The central claim that lightweight, reversible DOM manipulations (insert, mutate, recompose) can be performed reliably on arbitrary live web interfaces without breaking functionality or requiring app-specific engineering (Abstract) is load-bearing for the contribution. However, quantitative evaluations are reported on only two complex interfaces; no evidence is given that the grounding pipeline or manipulation primitives handle common cases such as virtual DOMs (React/Vue), shadow DOMs, heavy event delegation, or client-side state that can invalidate direct edits even when intended to be reversible.

Authors: We appreciate the referee highlighting this important point. The claim of broad applicability is indeed central to the contribution. Our quantitative evaluations were performed on two complex interfaces (a feature-rich dashboard and a collaborative productivity tool), as reported in Sections 5 and 6, demonstrating reliable performance in those cases. The pipeline and primitives are intentionally built on standard browser DOM APIs (query selectors, element creation/mutation, and event preservation), which operate on the rendered live DOM after any framework-specific rendering occurs. Virtual DOM approaches (React/Vue) ultimately expose standard HTML elements, so post-render manipulations apply without app-specific engineering. Shadow DOM encapsulation can be traversed using standard extension APIs when the extension runs in the appropriate context. Heavy event delegation is addressed by preserving original listeners during insert/mutate/recompose operations and relying on reversible snapshots. Client-side state changes are mitigated via mutation observers and full DOM restoration on dismissal to ensure reversibility. We acknowledge, however, that these mechanisms were not exhaustively validated across every possible edge case in the current evaluations. In the revised version, we will add a dedicated 'Limitations and Future Extensions' subsection in the Discussion that explicitly discusses these scenarios, potential failure modes, and how the design space could be extended (e.g., framework-aware grounding). We will also moderate the abstract language from 'arbitrary live web interfaces' to 'a wide range of live web interfaces' to better reflect the evaluated scope. This revision strengthens the paper without altering the core technical contribution. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper is a systems/HCI contribution that proposes in-situ assistance via lightweight DOM interventions, describes a design space and computational pipeline, instantiates it in DOMSteer, and supports the claims through quantitative evaluation on two interfaces plus a comparative user study. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the abstract or description. Claims rest on system implementation details and independent empirical results rather than reducing by construction to inputs or self-citations. The work is self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The work relies on standard web technologies and introduces new concepts and a system without fitted parameters or unproven mathematical axioms.

axioms (2)

domain assumption Browser extensions can access and modify the DOM of live web pages
This is a standard capability of Chrome extensions assumed for the DOMSteer system.
domain assumption GUI agents can interpret user help requests and ground them to relevant UI elements
Central to the computational pipeline described.

invented entities (2)

in-situ assistance no independent evidence
purpose: A new mode of support delivered directly within live web interfaces
New concept introduced by the authors.
DOMSteer no independent evidence
purpose: Implementation of the in-situ assistance as a Chrome extension
New system proposed and evaluated in the paper.

pith-pipeline@v0.9.0 · 5559 in / 1356 out tokens · 44216 ms · 2026-05-10T11:16:26.730442+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 39 canonical work pages · 2 internal anchors

[1]

X-WebArena-Leaderboard

2026. X-WebArena-Leaderboard. https://docs.google.com/spreadsheets/d/ 1M801lEpBbKSNwP-vDBkC_pF7LdyGU1f_ufZb_NWNBZQ/edit?gid=0 Google Sheets leaderboard

2026
[2]

Agarwal and W

B. Agarwal and W. Stuerzlinger. 2013. WidgetLens: a system for adaptive content magnification of widgets. InProceedings of the 27th International BCS Human Computer Interaction Conference(London, UK)(BCS-HCI ’13). BCS Learning & Development Ltd., Swindon, GBR, Article 2, 10 pages

2013
[3]

LangChain AI. 2024. LangGraph: Build Stateful, Multi-Agent Applications with LLMs. https://github.com/langchain-ai/langgraph. Accessed: 2025

2024
[4]

In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associa...

work page doi:10.1145/3290605.3300233 2019
[5]

Tan, Maxime Collomb, Daniel Robbins, Ken Hinckley, Maneesh Agrawala, Shengdong Zhao, and Gonzalo Ramos

Patrick Baudisch, Desney S. Tan, Maxime Collomb, Daniel Robbins, Ken Hinckley, Maneesh Agrawala, Shengdong Zhao, and Gonzalo Ramos. 2006. Phosphor: Ex- plaining Transitions in the User Interface Using Afterglow Effects. InProceedings of the 19th Annual ACM Symposium on User Interface Software and Technology (UIST ’06). ACM, New York, NY, USA, 169–178. doi...

work page doi:10.1145/1166253.1166277 2006
[6]

Davide Ceneda, Theresia Gschwandtner, Thorsten May, Silvia Miksch, Hans-Jörg Schulz, Marc Streit, and Christian Tominski. 2017. Characterizing Guidance in Visual Analytics.IEEE Transactions on Visualization and Computer Graphics23, 1 (2017), 111–120. doi:10.1109/TVCG.2016.2598468

work page doi:10.1109/tvcg.2016.2598468 2017
[7]

Juntong Chen, Jiang Wu, Jiajing Guo, Vikram Mohanty, Xueming Li, Jorge Pi- azentin Ono, Wenbin He, Liu Ren, and Dongyu Liu. 2025. InterChat: Enhancing generative visual analytics using multimodal interactions. InComputer Graphics Forum, Vol. 44. Wiley Online Library, e70112

2025
[8]

Ruijia Cheng, Titus Barik, Alan Leung, Fred Hohman, and Jeffrey Nichols. 2024. BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computa- tional Notebooks. arXiv:2404.07387 [cs.HC] https://arxiv.org/abs/2404.07387

work page arXiv 2024
[9]

Justin Cranshaw, Emad Elwany, Todd Newman, Rafal Kocielnik, Bowen Yu, Sandeep Soni, Jaime Teevan, and Andrés Monroy-Hernández. 2017. Calen- dar.help: Designing a Workflow-Based Scheduling Agent with Humans in the Loop. InProceedings of the 2017 CHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’17). Association for Computing ...

work page doi:10.1145/3025453.3025780 2017
[10]

Phil Cuvin, Hao Zhu, and Diyi Yang. 2025. DECEPTICON: How Dark Patterns Manipulate Web Agents.arXiv preprint arXiv:2512.22894(2025)

work page arXiv 2025
[11]

Leah Findlater and Joanna McGrenere. 2004. A comparison of static, adaptive, and adaptable menus. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Vienna, Austria)(CHI ’04). Association for Computing Machinery, New York, NY, USA, 89–96. doi:10.1145/985692.985704

work page doi:10.1145/985692.985704 2004
[12]

Tan, and Daniel S

Krzysztof Z Gajos, Mary Czerwinski, Desney S. Tan, and Daniel S. Weld. 2006. Exploring the design space for adaptive graphical user interfaces. InInternational Working Conference on Advanced Visual Interfaces. https://api.semanticscholar. org/CorpusID:207158977

2006
[13]

Gajos, Katherine Everitt, Desney S

Krzysztof Z. Gajos, Katherine Everitt, Desney S. Tan, Mary Czerwinski, and Daniel S. Weld. 2008. Predictability and accuracy in adaptive user interfaces. InCHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems. ACM, New York, NY, USA, 1271–1274. doi:10.1145/ 1357054.1357252

work page arXiv 2008
[14]

Krzysztof Z Gajos and Daniel S. Weld. 2004. SUPPLE: automatically generating user interfaces. InInternational Conference on Intelligent User Interfaces. https: //api.semanticscholar.org/CorpusID:2533528

2004
[15]

Gajos, Jacob O

Krzysztof Z. Gajos, Jacob O. Wobbrock, and Daniel S. Weld. 2008. Improving the performance of motor-impaired users with automatically-generated, ability- based interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Florence, Italy)(CHI ’08). Association for Computing Machinery, New York, NY, USA, 1257–1266. doi:10.1145/13...

work page doi:10.1145/1357054.1357250 2008
[16]

Camille Gobert, Kashyap Todi, Gilles Bailly, and Antti Oulasvirta. 2019. SAM: a modular framework for self-adapting web menus. InProceedings of the 24th International Conference on Intelligent User Interfaces(Marina del Ray, California) (IUI ’19). Association for Computing Machinery, New York, NY, USA, 481–484. doi:10.1145/3301275.3302314

work page doi:10.1145/3301275.3302314 2019
[17]

Google. 2026. Gemini in Chrome — AI assistance, right in your browser. https://gemini.google/overview/gemini-in-chrome/

2026
[18]

Tovi Grossman and George Fitzmaurice. 2010. ToolClips: an investigation of contextual video assistance for functionality understanding. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA)(CHI ’10). Association for Computing Machinery, New York, NY, USA, 1515–1524. doi:10.1145/1753326.1753552

work page doi:10.1145/1753326.1753552 2010
[19]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Pittsburgh, Pennsylvania, USA)(CHI ’99). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/302979.303030

work page doi:10.1145/302979.303030 1999
[20]

Nadkarni, Benjamin S

Kexin Huang, Payal Chandak, Qianwen Wang, Shreyas Havaldar, Akhil Vaid, Jure Leskovec, Girish N. Nadkarni, Benjamin S. Glicksberg, Nils Gehlenborg, and Marinka Zitnik. 2024. A Foundation Model for Clinician-Centered Drug Repurposing.Nature Medicine30, 12 (2024), 3601–3613. doi:10.1038/s41591-024- 03233-x

work page doi:10.1038/s41591-024- 2024
[21]

Zeyuan Huang, Cangjun Gao, Yaxian Shan, Haoxiang Hu, Qingkun Li, Xiaoming Deng, Cuixia Ma, Yu-Kun Lai, Yong-Jin Liu, Feng Tian, Guozhong Dai, and Hongan Wang. 2025. SketchGPT: A Sketch-based Multimodal Interface for Application-Agnostic LLM Interaction. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). As...

work page arXiv 2025
[22]

Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P

Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, and Graham Neubig. 2025. CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System...

work page doi:10.18653/v1/2025.naacl-demo.17 2025
[23]

Anthony Jameson. 2008. Adaptive Interfaces and Agents. InThe Human- Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerg- ing Applications(2nd edition ed.), Andrew Sears and Julie A. Jacko (Eds.). CRC Press, Boca Raton, FL, 433–458

2008
[24]

Anjali Khurana, Xiaotian Su, April Yi Wang, and Parmit K Chilana. 2025. Do It For Me vs. Do It With Me: Investigating User Perceptions of Different Paradigms of Automation in Copilots for Feature-Rich Software. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, ...

work page arXiv 2025
[25]

Anjali Khurana, Hariharan Subramonyam, and Parmit K Chilana. 2024. Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking. InProceedings of the 29th International Conference on Intelligent User Interfaces(Greenville, SC, USA) (IUI ’24). Association for Computing Machinery, New...

work page doi:10.1145/3640543.3645200 2024
[26]

Kimia Kiani, George Cui, Andrea Bunt, Joanna McGrenere, and Parmit K. Chilana
[27]

One-Size-Fits-All

Beyond "One-Size-Fits-All": Understanding the Diversity in How Software Newcomers Discover and Make Use of Help Resources. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–14. doi:10.1145/3290605.3300570

work page doi:10.1145/3290605.3300570 2019
[28]

UW Interactive Data Lab. 2024. Vega Datasets. https://vega.github.io/vega- datasets/ A collection of datasets used in Vega, Vega-Lite, and related projects

2024
[29]

Benjamin Lafreniere, Tovi Grossman, and George Fitzmaurice. 2013. Commu- nity enhanced tutorials: improving tutorials with multiple demonstrations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Paris, France)(CHI ’13). Association for Computing Machinery, New York, NY, USA, 1779–1788. doi:10.1145/2470654.2466235

work page doi:10.1145/2470654.2466235 2013
[30]

Lam, Omar Shaikh, Hallie Xu, Alice Guo, Diyi Yang, Jeffrey Heer, James A

Michelle S. Lam, Omar Shaikh, Hallie Xu, Alice Guo, Diyi Yang, Jeffrey Heer, James A. Landay, and Michael S. Bernstein. 2025. Just-In-Time Objectives: A General Approach for Specialized AI Interactions. arXiv:2510.14591 [cs.HC] https://arxiv.org/abs/2510.14591

work page arXiv 2025
[31]

Pattie Maes and Robyn Kozierok. 1993. Learning interface agents. InAAAI, Vol. 93. 459–465

1993
[32]

Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2011. Ambient help. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, Canada)(CHI ’11). Association for Computing Machinery, New York, NY, USA, 2751–2760. doi:10.1145/1978942.1979349

work page doi:10.1145/1978942.1979349 2011
[33]

Gui agents: A survey.arXiv preprint arXiv:2412.13501, 2024

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zho...

work page arXiv 2025
[34]

OpenAI. 2026. ChatGPT Atlas. https://chatgpt.com/atlas/. Accessed: 2026-02-24. Hao et al

2026
[35]

Bigham, and Amy Pavel

Yi-Hao Peng, Dingzeyu Li, Jeffrey P. Bigham, and Amy Pavel. 2025. Morae: Proactively Pausing UI Agents for User Choices. arXiv:2508.21456 [cs.HC] https://arxiv.org/abs/2508.21456

work page arXiv 2025
[36]

Donghao Ren, Fred Hohman, Halden Lin, and Dominik Moritz. 2025. Embedding Atlas: Low-Friction, Interactive Embedding Visualization. arXiv:2505.06386 [cs.HC] doi:10.48550/arXiv.2505.06386

work page doi:10.48550/arxiv.2505.06386 2025
[37]

Bernstein

Omar Shaikh, Shardul Sapkota, Shan Rizvi, Eric Horvitz, Joon Sung Park, Diyi Yang, and Michael S. Bernstein. 2025. Creating General User Models from Computer Use. arXiv:2505.10831 [cs.HC] https://arxiv.org/abs/2505.10831

work page arXiv 2025
[38]

Donghoon Shin, Daniel Lee, Gary Hsieh, and Gromit Yeuk-Yin Chan. 2025. PosterMate: Audience-driven Collaborative Persona Agents for Poster Design. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Association for Computing Machinery, New York, NY, USA, Article 201, 20 pages. doi:10.1145/3746059.3747769

work page doi:10.1145/3746059.3747769 2025
[39]

Smilkov, S

Daniel Smilkov, Shan Carter, D. Sculley, Fernanda B. Viégas, and Martin Wattenberg. 2017. Direct-Manipulation Visualization of Deep Networks. arXiv:1708.03788 [cs.LG] https://arxiv.org/abs/1708.03788

work page arXiv 2017
[40]

Arjun Srinivasan, Vidya Setlur, and Arvind Satyanarayan. 2025. Pluto: Author- ing Semantically Aligned Text and Charts for Data-Driven Communication. In Proceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). Association for Computing Machinery, New York, NY, USA, 1123–1140. doi:10.1145/3708359.3712122

work page doi:10.1145/3708359.3712122 2025
[41]

Christina Stoiber, Conny Walchshofer, Margit Pohl, Benjamin Potzmann, Florian Grassinger, Holger Stitz, Marc Streit, and Wolfgang Aigner. 2022. Comparative evaluations of visualization onboarding methods.Visual Informatics6, 4 (2022), 34–50

2022
[42]

Hari Subramonyam, Roy Pea, Christopher Pondoc, Maneesh Agrawala, and Colleen Seifert. 2024. Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Asso- ciation for Computing Machinery, New York, NY, USA, Arti...

work page doi:10.1145/3613904.3642754 2024
[43]

Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, and Zhiyong Wu. 2025. ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Tavily. 2026. Tavily Search API. https://docs.tavily.com/documentation/api- reference/endpoint/search. Accessed: 2026-03-28

2026
[45]

Nanobrowser Team. 2025. Nanobrowser: Open-Source Chrome Extension for AI-Powered Web Automation. https://github.com/nanobrowser/nanobrowser. Accessed: 2025

2025
[46]

Kashyap Todi, Jussi Jokinen, Kris Luyten, and Antti Oulasvirta. 2018. Famil- iarisation: Restructuring Layouts with Visual Learning Models. InProceedings of the 23rd International Conference on Intelligent User Interfaces(Tokyo, Japan) (IUI ’18). Association for Computing Machinery, New York, NY, USA, 547–558. doi:10.1145/3172944.3172949

work page doi:10.1145/3172944.3172949 2018
[47]

Glassman, Jeevana Priya Inala, and Chenglong Wang

Priyan Vaithilingam, Elena L. Glassman, Jeevana Priya Inala, and Chenglong Wang. 2024. DynaVis: Dynamically Synthesized UI Widgets for Visualization Editing. arXiv:2401.10880 [cs.HC] https://arxiv.org/abs/2401.10880

work page arXiv 2024
[48]

Qianwen Wang, Zhen Li, Siwei Fu, Weiwei Cui, and Huamin Qu. 2018. Narvis: Authoring narrative slideshows for introducing data visualization designs.IEEE transactions on visualization and computer graphics25, 1 (2018), 779–788

2018
[49]

Luoxuan Weng, Xingbo Wang, Junyu Lu, Yingchaojie Feng, Yihan Liu, Haozhe Feng, Danqing Huang, and Wei Chen. 2024. InsightLens: Augmenting LLM- Powered Data Analysis with Interactive Insight Management and Navigation. arXiv:2404.01644 [cs.HC] https://arxiv.org/abs/2404.01644

work page arXiv 2024
[50]

Wobbrock, Shaun K

Jacob O. Wobbrock, Shaun K. Kane, Krzysztof Z. Gajos, Susumu Harada, and Jon Froehlich. 2011. Ability-Based Design: Concept, Principles and Examples.ACM Trans. Access. Comput.3, 3, Article 9 (April 2011), 27 pages. doi:10.1145/1952383. 1952384

work page doi:10.1145/1952383 2011
[51]

Kanit Wongsuphasawat, Zening Qu, Dominik Moritz, Riley Chang, Felix Ouk, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2017. Voyager 2: Augmenting Visual Analysis with Partial View Specifications. InProc. ACM Human Factors in Computing Systems (CHI). doi:10.1145/3025453.3025768

work page doi:10.1145/3025453.3025768 2017
[52]

World Wide Web Consortium (W3C). 1998. Document Object Model (DOM) Level 1 Specification. W3C Recommendation. October 1, 1998

1998
[53]

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu
[54]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972 [cs.AI] https://arxiv.org/abs/ 2404.07972

work page internal anchor Pith review arXiv
[55]

Zamfirescu-Pereira, Richmond Y

J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang
[56]

Zhang, Jonathan Bragg, and Joseph Chee Chang

Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. doi:10.1145/3544548. 3581388

work page doi:10.1145/3544548 2023
[57]

Yuheng Zhao, Xueli Shu, Liwen Fan, Lin Gao, Yu Zhang, and Siming Chen
[58]

arXiv:2507.18165 [cs.HC] https://arxiv.org/abs/2507.18165

ProactiveVA: Proactive Visual Analytics with LLM-Based UI Agent. arXiv:2507.18165 [cs.HC] https://arxiv.org/abs/2507.18165

work page arXiv
[59]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI] https://arxiv.org/abs/2307.13854 A Abridged Prompts We outline prompts in more detail for the in-s...

work page Pith review arXiv 2024
[60]

insert.inline_control
[61]

mutate.representation
[62]

Insert an inline'Search fields' input next to [text] Fields to locate [control] Production Budget quickly

recompose.layout, configuration: [execution configuration of the DOM manipulation type], targets: [{ uiDescription: exact element label Beyond Chat and Clicks: GUI Agents for In-Situ Assistance via Live Interface Transformation from UI element list }] } An example generated assistance pair: { assistance: "Insert an inline'Search fields' input next to [tex...