pith. machine review for the scientific record. sign in

arxiv: 2604.04918 · v1 · submitted 2026-04-06 · 💻 cs.HC

Recognition: no theorem link

Comparing Human Oversight Strategies for Computer-Use Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:44 UTC · model grok-4.3

classification 💻 cs.HC
keywords computer-use agentshuman oversightdelegation strategiesuser studyLLM agentsinterventiontrustproblematic actions
0
0 comments X

The pith

Oversight strategy shapes exposure to problematic actions more than the ability to correct them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how different strategies for overseeing LLM-powered computer-use agents affect users' ability to prevent and address issues during task execution. It frames oversight as a coordination problem set by how much control is delegated to the agent and how engaged the human stays. A study with 48 participants using live web tasks compared four strategies and found that strategy choice influenced how often problematic actions appeared more reliably than it improved users' success at stepping in to fix them once visible. Plan-based strategies in particular cut down on bad actions occurring upfront. This matters because agents are shifting users into supervisory roles, so effective oversight requires making key decision moments visible and recognizable rather than simply increasing human involvement.

Core claim

We conceptualize CUA oversight as a structural coordination problem defined by delegation structure and engagement level, and use this lens to compare four oversight strategies in a mixed-methods study with 48 participants in a live web environment. Our results show that oversight strategy more reliably shaped users' exposure to problematic actions than their ability to correct them once visible. Plan-based strategies were associated with lower rates of agent problematic-action occurrence, but not equally strong gains in runtime intervention success once such actions became visible. Effective CUA oversight is not achieved by maximizing human involvement alone. Instead, it depends on how监督 is

What carries the argument

The structural lens of delegation structure and engagement level, which organizes comparison of oversight strategies to separate effects on exposure to problematic actions from effects on runtime correction success.

Load-bearing premise

The four tested oversight strategies adequately represent distinct delegation structures and engagement levels, and the live web environment and chosen tasks are representative of typical real-world computer-use agent scenarios.

What would settle it

A replication study that finds no reliable difference in rates of problematic action exposure across the four strategies would show that oversight structure does not shape exposure more than correction ability.

Figures

Figures reproduced from arXiv: 2604.04918 by Chaoran Chen, Eryue Xu, Ibrahim Khalilov, Simret A Gebreegziabher, Tianshi Li, Toby Jia-Jun Li, Yanfang Ye, Yaxing Yao, Yinuo Yang, Zeya Chen, Zhiping Zhang, Ziang Xiao.

Figure 1
Figure 1. Figure 1: Design space of CUA oversight strategies defined [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Interface instantiations of the four oversight strategies used in our study. (1) Risk-Gated: (a) agent focus and reasoning; [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Loan pre-qualification task with embedded privacy [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Flight booking task with embedded privacy leakage [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Benefits application task with embedded privacy [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Food ordering task with embedded prompt injection [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ticket purchase task with embedded dark pattern. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Review task with embedded prompt injection lead [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of item-level subjective ratings by task context and oversight strategy. Within each panel, stacked bars [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Base system prompt used across all oversight conditions. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

LLM-powered computer-use agents (CUAs) are shifting users from direct manipulation to supervisory coordination. Existing oversight mechanisms, however, have largely been studied as isolated interface features, making broader oversight strategies difficult to compare. We conceptualize CUA oversight as a structural coordination problem defined by delegation structure and engagement level, and use this lens to compare four oversight strategies in a mixed-methods study with 48 participants in a live web environment. Our results show that oversight strategy more reliably shaped users' exposure to problematic actions than their ability to correct them once visible. Plan-based strategies were associated with lower rates of agent problematic-action occurrence, but not equally strong gains in runtime intervention success once such actions became visible. On subjective measures, no single strategy was uniformly best, and the clearest context-sensitive differences appeared in trust. Qualitative findings further suggest that intervention depended not only on what controls users retained, but on whether risky moments became legible as requiring judgment during execution. These findings suggest that effective CUA oversight is not achieved by maximizing human involvement alone. Instead, it depends on how supervision is structured to surface decision-critical moments and support their recognition in time for meaningful intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a mixed-methods study with 48 participants comparing four oversight strategies for LLM-powered computer-use agents (CUAs) in a live web environment. Oversight is framed as a coordination problem defined by delegation structure and engagement level. The central empirical claim is that strategy more reliably affects users' exposure to problematic agent actions (lower occurrence with plan-based approaches) than their ability to correct such actions once visible, with no uniformly superior strategy on subjective measures and qualitative emphasis on the legibility of risky moments for intervention.

Significance. If the dissociation between exposure and correction holds after addressing power and reporting issues, the work offers timely empirical guidance for designing human oversight of autonomous agents, moving beyond isolated UI features. The live-environment mixed-methods design and primary data collection are strengths that enhance relevance to real-world CUA use. The conceptual lens on delegation and engagement provides a useful organizing framework, and the finding that maximizing involvement is not sufficient is a constructive contribution to the field.

major comments (2)
  1. [§4 (Results, intervention success analysis)] §4 (Results, intervention success analysis): The claim that plan-based strategies do not produce equally strong gains in runtime intervention success is load-bearing for the headline dissociation result, yet it rests on comparisons where the denominator (visible problematic actions) is smaller by construction in the lower-occurrence conditions. With only 48 participants across four strategies, per-condition event counts may fall below 10–15, rendering the absence of detectable differences difficult to distinguish from low power or sampling variability rather than a true effect.
  2. [§3 (Method)] §3 (Method): The operational definitions of 'problematic actions,' intervention success criteria, exclusion rules, and how occurrence rates were coded are not detailed enough in the abstract and appear underspecified for reproducibility; without these, the reliability of the exposure-versus-correction comparison cannot be fully assessed.
minor comments (2)
  1. [Abstract] Abstract: Include at least one key statistical result, effect size, or confidence interval supporting the 'more reliably shaped' and 'not equally strong gains' claims.
  2. [Figures and tables] Figures and tables: Add error bars, sample sizes per cell, and exact p-values or test statistics to all comparisons of occurrence and intervention rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where we will revise the manuscript to improve clarity, reproducibility, and the interpretation of our results.

read point-by-point responses
  1. Referee: [§4 (Results, intervention success analysis)] §4 (Results, intervention success analysis): The claim that plan-based strategies do not produce equally strong gains in runtime intervention success is load-bearing for the headline dissociation result, yet it rests on comparisons where the denominator (visible problematic actions) is smaller by construction in the lower-occurrence conditions. With only 48 participants across four strategies, per-condition event counts may fall below 10–15, rendering the absence of detectable differences difficult to distinguish from low power or sampling variability rather than a true effect.

    Authors: We agree that the smaller denominators in plan-based conditions reduce statistical power for detecting differences in intervention success rates, and that modest per-condition event counts (potentially below 10–15) make it difficult to distinguish absence of effect from sampling variability. In the revision we will (a) report exact counts of visible problematic actions per condition, (b) add confidence intervals around intervention success proportions, and (c) include an explicit discussion of power limitations with a post-hoc power calculation. We will also qualify the interpretation by noting that the primary dissociation is anchored in the statistically robust differences in occurrence rates (which rest on larger event pools), while the intervention-success comparisons should be viewed as exploratory. These changes will strengthen rather than alter the headline claim. revision: partial

  2. Referee: [§3 (Method)] §3 (Method): The operational definitions of 'problematic actions,' intervention success criteria, exclusion rules, and how occurrence rates were coded are not detailed enough in the abstract and appear underspecified for reproducibility; without these, the reliability of the exposure-versus-correction comparison cannot be fully assessed.

    Authors: We accept that the current description of coding procedures is insufficient for full reproducibility. We will expand §3 with precise operational definitions: (1) the full coding rubric for problematic actions, including concrete examples drawn from the study tasks and decision rules for borderline cases; (2) the exact criteria used to classify an intervention as successful (e.g., action prevented, corrected, or agent paused); (3) all exclusion rules applied to participants or trials; and (4) the step-by-step protocol for computing occurrence rates, including any inter-rater reliability statistics. These additions will allow readers to evaluate the exposure-versus-correction comparison directly. revision: yes

Circularity Check

0 steps flagged

No circularity: primary empirical user study with independent data collection

full rationale

The paper reports results from a mixed-methods experiment with 48 new participants performing tasks in a live web environment. It contains no equations, fitted parameters, predictions, or derivations that reduce to prior inputs by construction. The central claims rest on direct observation of occurrence rates and intervention success across four oversight strategies, plus qualitative analysis. No self-citation chains or ansatzes are invoked to justify the outcome measures. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical mixed-methods user study; the central claims rest on experimental design, participant responses, and qualitative coding rather than mathematical axioms, free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5542 in / 1228 out tokens · 71819 ms · 2026-05-10T19:44:10.134689+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

    cs.AI 2026-04 unverdicted novelty 5.0

    General-purpose coding agents achieve highest success on SciVis tasks but at high cost, while domain-specific agents are efficient yet less flexible and computer-use agents struggle with long workflows.

  2. Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

    cs.AI 2026-04 unverdicted novelty 5.0

    General-purpose coding agents achieve highest success on SciVis tasks but cost more compute, while domain-specific agents are efficient yet less flexible and computer-use agents falter on long workflows.

Reference graph

Works this paper leans on

58 extracted references · 33 canonical work pages · cited by 1 Pith paper

  1. [1]

    Bennett, Kori Inkpen, Thomas Teevan, Ruth Kikin-Gil, and Eric Horvitz

    Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associa...

  2. [2]

    Lisanne Bainbridge. 1983. Ironies of automation.Automatica19, 6 (1983), 775–779. doi:10.1016/0005-1098(83)90046-8

  3. [3]

    John Brooke et al. 1996. SUS-A quick and dirty usability scale.Usability evaluation in industry189, 194 (1996), 4–7

  4. [4]

    Chaoran Chen, Zhiping Zhang, Bingcan Guo, Shang Ma, Ibrahim Khalilov, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, and Toby Jia-Jun Li. 2025. The Obvious Invisible Threat: LLM-Powered GUI Agents’ Vulnerability to Fine-Print Injections. arXiv:2504.11281 [cs.HC] https: //arxiv.org/abs/2504.11281

  5. [5]

    Chaoran Chen, Daodao Zhou, Yanfang Ye, Toby Jia-Jun Li, and Yaxing Yao

  6. [6]

    InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25)

    CLEAR: Towards Contextual LLM-Empowered Privacy Policy Analysis and Risk Generation for Large Language Model Applications. InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). Association for Computing Machinery, New York, NY, USA, 277–297. doi:10.1145/3708359. 3712156

  7. [7]

    Victoria Clarke and Virginia Braun. 2017. Thematic analy- sis.The Journal of Positive Psychology12, 3 (2017), 297–298. arXiv:https://doi.org/10.1080/17439760.2016.1262613 doi:10.1080/17439760.2016. 1262613

  8. [8]

    Claude Code. 2026. Use Claude Code with Chrome (beta). https://code.claude. com/docs/en/chrome. Accessed: 2026-03-26

  9. [9]

    Phil Cuvin, Hao Zhu, and Diyi Yang. 2026. DECEPTICON: How Dark Patterns Manipulate Web Agents. arXiv:2512.22894 [cs.CR] https://arxiv.org/abs/2512. 22894

  10. [10]

    Dzindolet, Scott A

    Mary T. Dzindolet, Scott A. Peterson, Regina A. Pomranky, Linda G. Pierce, and Hall P. Beck. 2003. The role of trust in automation reliance.Int. J. Hum.-Comput. Stud.58, 6 (June 2003), 697–718. doi:10.1016/S1071-5819(03)00038-7

  11. [11]

    Endsley and Esin O

    Mica R. Endsley and Esin O. Kiris. 1995. The Out-of-the-Loop Perfor- mance Problem and Level of Control in Automation.Human Factors37, 2 (1995), 381–394. arXiv:https://doi.org/10.1518/001872095779064555 doi:10.1518/ 001872095779064555

  12. [12]

    Cedric Faas, Sophie Kerstan, Richard Uth, Markus Langer, and Anna Maria Feit

  13. [13]

    InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26)

    Design Considerations for Human Oversight of AI: Insights from Co-Design Workshops and Work Design Theory. InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26). Association for Computing Machinery, New York, NY, USA, 804–821. doi:10.1145/3742413.3789100

  14. [14]

    K. J. Kevin Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S. Weld, Amy X. Zhang, and Joseph Chee Chang. 2026. Cocoa: Co-Planning and Co-Execution with AI Agents. arXiv:2412.10999 [cs.HC] https: //arxiv.org/abs/2412.10999

  15. [15]

    Google. 2026. Gemini 3.1 Flash-Lite: Built for intelligence at scale. https://blog.google/innovation-and-ai/models-and-research/gemini- models/gemini-3-1-flash-lite/. Accessed: 2026-03-26

  16. [16]

    Gray, Cristiana Teixeira Santos, Nataliia Bielova, and Thomas Mildner

    Colin M. Gray, Cristiana Teixeira Santos, Nataliia Bielova, and Thomas Mildner

  17. [17]

    InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24)

    An Ontology of Dark Patterns Knowledge: Foundations, Definitions, and a Pathway for Shared Knowledge-Building. InProceedings of the 2024 CHI Con- ference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 289, 22 pages. doi:10.1145/3613904.3642436

  18. [18]

    Madeleine Grunde-McLaughlin, Hussein Mozannar, Maya Murad, Jingya Chen, Saleema Amershi, and Adam Fourney. 2026. Overseeing Agents Without Constant Oversight: Challenges and Opportunities. arXiv:2602.16844 [cs.HC] https://arxiv. org/abs/2602.16844

  19. [19]

    Gaole He, Gianluca Demartini, and Ujwal Gadiraju. 2025. Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 414, 22 pages. doi:10.1145/370...

  20. [20]

    Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Pittsburgh, Pennsylvania, USA)(CHI ’99). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/302979.303030

  21. [21]

    Gary Klein, Brian Moon, and Robert R. Hoffman. 2006. Making Sense of Sense- making 1: Alternative Perspectives.IEEE Intelligent Systems21, 4 (July 2006), 70–73. doi:10.1109/MIS.2006.75

  22. [22]

    1998.Sources of power: How people make decisions

    Gary A Klein. 1998.Sources of power: How people make decisions. MIT press

  23. [23]

    Johann Laux and Hannah Ruschemeier. 2025. Automation Bias in the AI Act: On the Legal Implications of Attempting to De-Bias Human Oversight of AI.European Journal of Risk Regulation16, 4 (2025), 1519–1534. doi:10.1017/err.2025.10033

  24. [24]

    Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. 2025. EIA: ENVIRONMENTAL IN- JECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAK- AGE. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=xMOLUzo2Lk

  25. [25]

    Arien Mack. 2003. Inattentional Blindness: Looking Without See- ing.Current Directions in Psychological Science12, 5 (2003), 180–184. arXiv:https://doi.org/10.1111/1467-8721.01256 doi:10.1111/1467-8721.01256

  26. [26]

    Mostafa, Mohd Sharifuddin Ahmad, and Aida Mustapha

    Salama A. Mostafa, Mohd Sharifuddin Ahmad, and Aida Mustapha. 2019. Ad- justable autonomy: a systematic literature review.Artif. Intell. Rev.51, 2 (Feb. 2019), 149–186. doi:10.1007/s10462-017-9560-8

  27. [27]

    Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, Eric Zhu, Griffin Bassman, Jacob Alber, Peter Chang, Ricky Loynd, Friederike Niedtner, Ece Kamar, Maya Murad, Rafah Hosn, and Saleema Amershi. 2025. Magentic-UI: Towards Human-in-the-loop Age...

  28. [28]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zho...

  29. [29]

    OpenAI. 2025. Introducing Operator-Safety and privacy. https://openai.com/ index/introducing-operator/ Accessed: 2025-01-19

  30. [30]

    Raja Parasuraman and Victor Riley. 1997. Humans and Automa- tion: Use, Misuse, Disuse, Abuse.Human Factors39, 2 (1997), 230–

  31. [31]

    arXiv:https://doi.org/10.1518/001872097778543886 doi:10.1518/ 001872097778543886

  32. [32]

    IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans30(3), 286–297 (2000) https: //doi.org/10.1109/3468.844354

    R. Parasuraman, T. B. Sheridan, and C. D. Wickens. 2000. A model for types and levels of human interaction with automation.Trans. Sys. Man Cyber. Part A30, 3 (May 2000), 286–297. doi:10.1109/3468.844354

  33. [33]

    Ronald A Rensink. 2002. Change detection.Annual review of psychology53, 1 (2002), 245–277

  34. [34]

    Susana Rubio, Eva Díaz, Jesús Martín, and José M. Puente. 2004. Evaluation of Subjective Mental Workload: A Comparison of SWAT, NASA-TLX, and Workload Profile Methods.Applied Psychology53, 1 (2004), 61–86. arXiv:https://iaap- journals.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1464-0597.2004.00161.x doi:10. 1111/j.1464-0597.2004.00161.x

  35. [35]

    Leung, and Xiaodie Ling

    Xiaoxiao Song, Huimin Gu, Yunpeng Li, Xi Y. Leung, and Xiaodie Ling. 2024. The influence of robot anthropomorphism and perceived intelligence on hotel Conference’17, July 2017, Washington, DC, USA Chen et al. guests’ continuance usage intention.Information Technology & Tourism26, 1 (2024), 89–117. doi:10.1007/s40558-023-00275-8

  36. [36]

    Sarah Sterz, Kevin Baum, Sebastian Biewer, Holger Hermanns, Anne Lauber- Rönsberg, Philip Meinel, and Markus Langer. 2024. On the Quest for Effectiveness in Human Oversight: Interdisciplinary Perspectives. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computing Ma...

  37. [37]

    Jingyu Tang, Chaoran Chen, Jiawen Li, Zhiping Zhang, Bingcan Guo, Ibrahim Khalilov, Simret Araya Gebreegziabher, Bingsheng Yao, Dakuo Wang, Yanfang Ye, Tianshi Li, Ziang Xiao, Yaxing Yao, and Toby Jia-Jun Li. 2025. Dark Patterns Meet GUI Agents: LLM Agent Susceptibility to Manipulative Interfaces and the Role of Human Oversight. arXiv:2509.10723 [cs.HC] h...

  38. [38]

    Nenad Tomašev, Matija Franklin, and Simon Osindero. 2026. Intelligent AI Delegation. arXiv:2602.11865 [cs.AI] https://arxiv.org/abs/2602.11865

  39. [39]

    Yinuo Yang, Ashley Ge Zhang, Steve Oney, and April Yi Wang. 2025. Spark: Real-Time Monitoring of Multi-Faceted Programming Exercises. In2025 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 81–92. doi:10.1109/VL-HCC65237.2025.00018

  40. [40]

    Bingsheng Yao, Chaoran Chen, April Yi Wang, Sherry Tongshuang Wu, Toby Jia jun Li, and Dakuo Wang. 2026. From Human-Human Collaboration to Human- Agent Collaboration: A Vision, Design Philosophy, and an Empirical Frame- work for Achieving Successful Partnerships Between Humans and LLM Agents. arXiv:2602.05987 [cs.HC] https://arxiv.org/abs/2602.05987

  41. [41]

    Bingsheng Yao, Jiaju Chen, Chaoran Chen, April Wang, Toby Jia jun Li, and Dakuo Wang. 2026. Through the Lens of Human-Human Collaboration: A Configurable Research Platform for Exploring Human-Agent Collaboration. arXiv:2509.18008 [cs.HC] https://arxiv.org/abs/2509.18008

  42. [42]

    Shuning Zhang, Jingruo Chen, Zhiqi Gao, Jiajing Gao, Xin Yi, and Hewu Li. 2025. Characterizing Unintended Consequences in Human-GUI Agent Collaboration for Web Browsing. arXiv:2505.09875 [cs.HC] https://arxiv.org/abs/2505.09875

  43. [43]

    Shuning Zhang, Yutong Jiang, Rongjun Ma, Yuting Yang, Mingyao Xu, Zhixin Huang, Xin Yi, and Hewu Li. 2025. PrivWeb: Unobtrusive and Content-aware Privacy Protection For Web Agents. arXiv:2509.11939 [cs.HC] https://arxiv.org/ abs/2509.11939

  44. [44]

    Zhiping Zhang, Bingcan Guo, and Tianshi Li. 2025. Privacy Leakage Overshad- owed by Views of AI: A Study on Human Oversight of Privacy in Language Model Agent. arXiv:2411.01344 [cs.HC] https://arxiv.org/abs/2411.01344 A Agent Prompts and Oversight Policies This section documents the prompts and control policies used to instantiate the four oversight strat...

  45. [45]

    Return a practical end-to-end plan, not just the next action.↩→

  46. [46]

    Use 3-6 concrete execution steps in plain language

  47. [47]

    Each step must be one complete but short sentence

  48. [48]

    Do not output tool-call XML tags

  49. [49]

    A.3.3 Execution Constraint.The following execution instruction was added after plan review: Plan review is complete

    Do not output metadata tags like <thinking_summary> or <impact>.↩→ Output format (strict): Plan Summary: <one concise sentence> Step 1: <text> Step 2: <text> Step 3: <text> Comparing Human Oversight Strategies for Computer-Use Agents Conference’17, July 2017, Washington, DC, USA (add more steps as needed) A.3.2 Plan Injection.After plan approval, the foll...

  50. [50]

    emit exactly one valid XML tool call for the first approved step, or↩→

  51. [51]

    Strongly agree

    if the page already proves that first approved step is done, use an observation tool to verify it.↩→ Start with approved step 1: [FIRST_STEP_TEXT] Do not output plain reasoning without a tool call. A.4 Condition 3: Action-Confirmation Oversight A.4.1 Execution Constraint. Action-confirmation mode is active. Do not stop at a plain-language explanation of t...

  52. [52]

    **Tab Context Awareness**: •All tools operate on the CURRENTLY ACTIVE TAB •Use browser_get_active_tab to check which tab is active •Use browser_tab_select to switch between tabs •After switching tabs, ALWAYS verify the switch was successful

  53. [53]

    **Tab Management Workflow**: •browser_tab_list: Lists all open tabs •browser_tab_new: Creates a new tab (doesn't automatically switch to it) •browser_tab_select: Switches to a different tab •browser_tab_close: Closes a tab

  54. [54]

    **Tab-Specific Operations**: •browser_navigate_tab: Navigate a specific tab without switching to it •browser_screenshot_tab: Take a screenshot of a specific tab

  55. [55]

    Use browser_tab_list to see all tabs b

    **Common Multi-Tab Workflow**: a. Use browser_tab_list to see all tabs b. Use browser_tab_select to switch to desired tab c. Use browser_get_active_tab to verify the switch d. Perform operations on the now-active tab ## CANONICAL SEQUENCE Run **every task in this exact order**:

  56. [56]

    **Observe first** – Use browser_read_text, browser_snapshot_dom, browser_query, or browser_screenshot to verify current state

  57. [57]

    **Analyze** – Decide the next smallest safe action based on observed state and USER GLOBAL KNOWLEDGE

  58. [58]

    Task completion

    **Act** – Execute exactly one tool call at a time, then re-observe before continuing. ### VERIFICATION NOTES •Describe exactly what you see—never assume. •If an expected element is missing, state that. •Double-check critical states with a second observation tool. ## HARD UI SAFETY RULE Never click the "Task completion" floating window, banner, modal, over...