pith. machine review for the scientific record. sign in

arxiv: 2510.05307 · v3 · submitted 2025-10-06 · 💻 cs.HC

When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks

Pith reviewed 2026-05-18 08:56 UTC · model grok-4.3

classification 💻 cs.HC
keywords AI agentsuser confirmationmulti-step taskserror recoverydecision-theoretic modelhuman-AI interactionconfirmation schedulingagentic workflows
0
0 comments X

The pith

A decision-theoretic model for scheduling user confirmations in multi-step AI tasks reduces completion time by 13.54 percent and earns preference from 81 percent of users over end-only checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to determine optimal points for users to confirm progress in AI agents that handle multi-step tasks. Current systems either wait until the end, allowing early errors to force full restarts, or require checks at every step, which adds unnecessary effort. The authors first identified a recurring pattern in how people detect and fix mistakes, then built a model that treats confirmation placement as a scheduling problem aimed at minimizing total time. Evaluation with participants showed the resulting intermediate checks produced faster task finishes and higher satisfaction than the confirm-at-end baseline. If the approach holds, AI agents could support complex work with fewer wasted efforts while keeping users involved at the right moments.

Core claim

The paper establishes that confirmation frequency in agentic AI can be optimized by modeling it as a minimum-time scheduling problem grounded in the Confirmation-Diagnosis-Correction-Redo pattern observed in user monitoring behavior. Formative observations with a small group revealed how users typically verify steps, diagnose issues, correct them, and redo segments when needed. This pattern informed a decision-theoretic model that selects confirmation points to balance the cost of interruptions against the cost of rollbacks. A larger within-subjects study then demonstrated that this model-guided placement yields shorter overall task times and stronger user preference compared to confirming a

What carries the argument

The Confirmation-Diagnosis-Correction-Redo (CDCR) pattern, which captures the sequence users follow when monitoring and repairing AI errors, serves as the foundation for the decision-theoretic model that schedules confirmation points to minimize total execution plus recovery time.

If this is right

  • Task completion time decreases when confirmations occur at model-selected intermediate points rather than only at the end.
  • Most users prefer the intermediate confirmation strategy over the confirm-at-end method used in current systems.
  • AI agents become less prone to complete restarts from single early errors while avoiding the overhead of constant checks.
  • The scheduling model offers a concrete way to trade off interruption costs against rollback costs in agent workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scheduling logic could be applied to other human oversight scenarios such as data analysis pipelines or code generation sessions.
  • AI system builders might embed similar models to suggest or automate check points dynamically based on task structure.
  • Over time, repeated use of the pattern could reveal broader rules for when human intervention pays off in longer agent runs.

Load-bearing premise

The Confirmation-Diagnosis-Correction-Redo pattern observed with eight participants generalizes to the tasks and participants used in the larger evaluation, enabling the model to correctly predict efficient confirmation points.

What would settle it

A new study using different multi-step tasks that measures whether the model still produces measurable time savings and user preference over confirm-at-end would directly test the claim.

Figures

Figures reproduced from arXiv: 2510.05307 by Aryan Roy, Christopher J. MacLellan, Daniel Weitekamp, Jieyu Zhou, Sneh Gupta.

Figure 1
Figure 1. Figure 1: Confirmation–Diagnosis–Correction–Redo (CDCR) Pattern. Based on the expected time of user interaction under the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative Task Types and AI Agent Interface Example [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of the interval between the last verified state and the next confirmation point. The colors denote the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dynamic programming algorithm and results for an agent action sequence of length [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of the simulated shopping task interface with intermediate confirmation, the activity list on the left and the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average completion time under different confirmation strategies. Left: Intermediate confirmation reduced total time [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Participant Likert scale scores for the questions asked in our post-study survey (Left).Preferred confirmation [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Existing AI agents typically execute multi-step tasks autonomously and only allow user confirmation at the end. During execution, users have little control, making the confirm-at-end approach brittle: a single error can cascade and force a complete restart. Confirming every step avoids such failures, but imposes tedious overhead. Balancing excessive interruptions against costly rollbacks remains an open challenge. We address this problem by modeling confirmation as a minimum time scheduling problem. We conducted a formative study with eight participants, which revealed a recurring Confirmation-Diagnosis-Correction-Redo (CDCR) pattern in how users monitor errors. Based on this pattern, we developed a decision-theoretic model to determine time-efficient confirmation point placement. We then evaluated our approach using a within-subjects study where 48 participants monitored AI agents and repaired their mistakes while executing tasks. Results show that 81 percent of participants preferred our intermediate confirmation approach over the confirm-at-end approach used by existing systems, and task completion time was reduced by 13.54 percent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that modeling user confirmation in multi-step AI agent tasks as a minimum-time scheduling problem, informed by a Confirmation-Diagnosis-Correction-Redo (CDCR) pattern observed in a formative study with 8 participants, yields a decision-theoretic model that places intermediate confirmations more efficiently than the confirm-at-end baseline used by existing systems. In a within-subjects evaluation with 48 participants, this approach is preferred by 81% of users and reduces task completion time by 13.54%.

Significance. If the CDCR pattern generalizes beyond the small formative sample and the reported gains are attributable to the model's specific placement decisions rather than intermediate confirmations in general, the work could improve reliability and user agency in agentic AI systems by reducing costly rollbacks while avoiding excessive interruptions. The within-subjects design and concrete preference/time metrics provide a useful empirical anchor for future human-AI interaction research.

major comments (2)
  1. [Formative Study] Formative study: The CDCR pattern and its associated time costs are derived from only eight participants, yet the manuscript provides no data, logs, or analysis demonstrating that this pattern occurs at comparable rates or with similar costs in the 48-participant main study. Because the decision-theoretic model is parameterized directly from this pattern to select confirmation points, the absence of re-validation means the 81% preference and 13.54% time reduction cannot be confidently attributed to the model rather than to any form of intermediate confirmation.
  2. [Evaluation] Evaluation section: No details are supplied on the statistical tests supporting the time-reduction claim, the controls used for task difficulty across conditions, or how the free parameters (time costs for confirmation, diagnosis, correction, and redo) were estimated or fitted from the formative data. These omissions are load-bearing for the central quantitative claims.
minor comments (1)
  1. The abstract and methods could more explicitly state the exact tasks used, how confirmation points were implemented in the interface, and whether participants were aware of the different conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below and have made revisions to strengthen the manuscript where the comments identify omissions or areas needing clarification.

read point-by-point responses
  1. Referee: [Formative Study] Formative study: The CDCR pattern and its associated time costs are derived from only eight participants, yet the manuscript provides no data, logs, or analysis demonstrating that this pattern occurs at comparable rates or with similar costs in the 48-participant main study. Because the decision-theoretic model is parameterized directly from this pattern to select confirmation points, the absence of re-validation means the 81% preference and 13.54% time reduction cannot be confidently attributed to the model rather than to any form of intermediate confirmation.

    Authors: The formative study was conducted to identify the recurring CDCR behavioral pattern through direct observation, which then informed the time-cost parameters in the decision-theoretic model. We acknowledge that the original manuscript did not include a quantitative re-validation of CDCR occurrence rates from the main-study logs. To address this concern, we have added an analysis section that reports the frequency of CDCR sequences observed in the 48-participant data and compares the empirical time costs to those used in the model. This addition helps demonstrate that the pattern generalizes and that the reported gains stem from the model's optimized placement decisions rather than intermediate confirmations in general. revision: yes

  2. Referee: [Evaluation] Evaluation section: No details are supplied on the statistical tests supporting the time-reduction claim, the controls used for task difficulty across conditions, or how the free parameters (time costs for confirmation, diagnosis, correction, and redo) were estimated or fitted from the formative data. These omissions are load-bearing for the central quantitative claims.

    Authors: We agree that these methodological details were insufficiently reported. In the revised manuscript we now specify: (1) the statistical tests (paired t-tests with reported t-statistics, p-values, and Cohen's d effect sizes for completion time, plus Wilcoxon signed-rank tests for preference data); (2) the controls for task difficulty, including selection of tasks with matched step counts and complexity, plus within-subjects counterbalancing of condition order; and (3) the parameter estimation procedure, in which mean durations for each CDCR phase were computed directly from timestamped observations in the formative study and used as fixed inputs to the scheduler. These additions make the quantitative claims fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in separate formative data and standard decision theory

full rationale

The paper's chain proceeds from an independent formative study (n=8) that identifies the CDCR pattern, applies standard decision-theoretic scheduling to place confirmations, and then evaluates the resulting approach in a separate within-subjects experiment (n=48) whose outcomes (preference and time metrics) are measured directly rather than derived by construction from the model's fitted parameters. No equations or claims reduce the central result to self-definition, renamed inputs, or load-bearing self-citations; the evaluation data and participant preferences constitute external evidence relative to the model's construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the generalizability of the CDCR pattern observed in a small formative study and on accurate estimation of time costs for user actions in the decision model; these are not derived from first principles but fitted or assumed from user data.

free parameters (1)
  • Time costs for confirmation, diagnosis, correction, and redo actions
    Estimated or measured from participant behavior to optimize confirmation placement in the scheduling model.
axioms (1)
  • domain assumption Users follow a recurring Confirmation-Diagnosis-Correction-Redo (CDCR) pattern when monitoring and repairing AI agent errors.
    Identified in the formative study with eight participants and used as the basis for the decision-theoretic model.

pith-pipeline@v0.9.0 · 5714 in / 1519 out tokens · 62046 ms · 2026-05-18T08:56:12.166344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

107 extracted references · 107 canonical work pages · 11 internal anchors

  1. [1]

    Constructions Aeronautiques, Adele Howe, Craig Knoblock, ISI Drew McDer- mott, Ashwin Ram, Manuela Veloso, Daniel Weld, David Wilkins Sri, Anthony Barrett, Dave Christianson, et al. 1998. Pddl—the planning domain definition language.Technical Report, Tech. Rep.(1998)

  2. [2]

    Layla AI. 2024. About me, Layla: AI Trip Planner and Travel Sidekick. https: //layla.ai/about. Accessed: July 2025

  3. [3]

    Pokee AI. 2024. Revolutionize Digital Productivity with AI-Powered Workflow Automation. https://pokee.ai/home. Accessed: July 2025

  4. [4]

    Lize Alberts, Ulrik Lyngs, and Max Van Kleek. 2024. Computers as bad social actors: Dark patterns and anti-patterns in interfaces that act socially.Proceedings of the ACM on Human-Computer Interaction8, CSCW1 (2024), 1–25

  5. [5]

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schul- man, and Dan Mané. 2016. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565(2016)

  6. [6]

    Anthropic. 2024. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https://www.anthropic.com/news/3-5-models-and-computer-use. Accessed: July 2025

  7. [7]

    Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena L Glassman. 2024. Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–18

  8. [8]

    Zahra Ashktorab, Mohit Jain, Q Vera Liao, and Justin D Weisz. 2019. Resilient chatbots: Repair strategy preferences for conversational breakdowns. InPro- ceedings of the 2019 CHI conference on human factors in computing systems. 1–12

  9. [9]

    Rui Assis and Pedro Carmona Marques. 2021. A dynamic methodology for setting up inspection time intervals in conditional preventive maintenance. Applied Sciences11, 18 (2021), 8715

  10. [10]

    Gagan Bansal, Jennifer Wortman Vaughan, Saleema Amershi, Eric Horvitz, Adam Fourney, Hussein Mozannar, Victor Dibia, and Daniel S Weld. 2024. Chal- lenges in human-agent communication.arXiv preprint arXiv:2412.10380(2024)

  11. [11]

    Richard Barlow and Larry Hunter. 1960. Optimum preventive maintenance policies.Operations research8, 1 (1960), 90–100

  12. [12]

    Hugh Beyer and Karen Holtzblatt. 1999. Contextual design.interactions6, 1 (1999), 32–42

  13. [13]

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al

  14. [14]

    Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264(2024)

  15. [15]

    2012.Thematic analysis.57–71

    Virginia Braun and Victoria Clarke. 2012.Thematic analysis.57–71

  16. [16]

    Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, et al. 2024. PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks.arXiv preprint arXiv:2411.00081(2024)

  17. [17]

    Andy Cockburn, Blaine Lewis, Philip Quinn, and Carl Gutwin. 2020. Framing effects influence interface feature decisions. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–11

  18. [18]

    Andy Cockburn, Philip Quinn, Carl Gutwin, Zhe Chen, and Pang Suwanaposee

  19. [19]

    InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems

    Probability weighting in interactive decisions: Evidence for overuse of bad assistance, underuse of good assistance. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–12

  20. [20]

    CrewAI. 2024. What is CrewAI? https://docs.crewai.com/en/introduction. Ac- cessed: July 2025

  21. [21]

    Anindya Das Antar, Somayeh Molaei, Yan-Ying Chen, Matthew L Lee, and Nikola Banovic. 2024. VIME: Visual Interactive Model Explorer for Identify- ing Capabilities and Limitations of Machine Learning Models for Sequential Decision-Making. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–21

  22. [22]

    Bram De Jonge and Philip A Scarf. 2020. A review on maintenance optimization. European journal of operational research285, 3 (2020), 805–824

  23. [23]

    Rommert Dekker, Ralph E Wildeman, and Frank A Van der Duyn Schouten. 1997. A review of multi-component maintenance models with economic dependence. Mathematical methods of operations research45, 3 (1997), 411–435

  24. [24]

    Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, et al. 2024. Agent ai: Surveying the horizons of multimodal interaction.arXiv preprint arXiv:2401.03568(2024). When Should Users Check? A Decision-Theoretic Model of Confirmation Frequency in Multi-Step AI Agent Task...

  25. [25]

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D Hwang, et al. 2023. Faith and fate: Limits of transformers on compositionality (2023).arXiv preprint arXiv:2305.18654(2023)

  26. [26]

    Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang Zhu, and Saleema Amershi. 2025. Interactive debugging and steering of multi-agent ai systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–15

  27. [27]

    KJ Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S Weld, Amy X Zhang, and Joseph Chee Chang. 2024. Cocoa: Co-planning and co-execution with ai agents.arXiv preprint arXiv:2412.10999(2024)

  28. [28]

    Flowith. 2024. Introduction to Flow Mode. https://doc.flowith.io/flow-mode- general-mode/introduction-to-flow-mode. Accessed: July 2025

  29. [29]

    Genspark. 2024. Understanding Genspark.AI. https://genspark.ai/spark/ understanding-genspark-ai. Accessed: July 2025

  30. [30]

    Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI conference on Human Factors in Computing Systems. 159–166

  31. [31]

    Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. 2024. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323(2024)

  32. [32]

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey.arXiv preprint arXiv:2402.02716(2024)

  33. [33]

    What’s important here?

    Faria Huq, Jeffrey P Bigham, and Nikolas Martelaro. 2023. " What’s important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces.arXiv preprint arXiv:2312.06147(2023)

  34. [34]

    Deniz Iren and Semih Bilgen. 2014. Cost of quality in crowdsourcing.Human Computation1, 2 (2014)

  35. [35]

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation.Comput. Surveys55, 12 (2023), 1–38

  36. [36]

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2024. DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?arXiv preprint arXiv:2409.07703(2024)

  37. [37]

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran- Johnson, et al. 2022. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221(2022)

  38. [38]

    Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. 2024. Ai agents that matter.arXiv preprint arXiv:2407.01502(2024)

  39. [39]

    Majeed Kazemitabaar, Jack Williams, Ian Drosos, Tovi Grossman, Austin Zachary Henley, Carina Negreanu, and Advait Sarkar. 2024. Improving steering and verification in AI-assisted data analysis with interactive task decomposition. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–19

  40. [40]

    Kevin C Lam, Carl Gutwin, Madison Klarkowski, and Andy Cockburn. 2022. More Errors vs. Longer Commands: The Effects of Repetition and Reduced Expressiveness on Input Interpretation Error, Learning, and Effort. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–17

  41. [41]

    MacLellan

    Lane Lawley and Christopher J. MacLellan. 2024. VAL: Interactive Task Learning with GPT Dialog Parsing. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). https://doi.org/10.1145/3613904.3641915

  42. [42]

    John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80

  43. [43]

    Ariel Levy, Monica Agrawal, Arvind Satyanarayan, and David Sontag. 2021. Assessing the impact of automated suggestions on decision making: Domain experts mediate model errors but take less initiative. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–13

  44. [44]

    Amanda Li, Jason Wu, and Jeffrey P Bigham. 2023. Using LLMs to Customize the UI of Webpages. InAdjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–3

  45. [45]

    Toby Jia-Jun Li, Amos Azaria, and Brad A Myers. 2017. SUGILITE: creating multimodal smartphone automation by demonstration. InProceedings of the 2017 CHI conference on human factors in computing systems. 6038–6049

  46. [46]

    Toby Jia-Jun Li, Jingya Chen, Haijun Xia, Tom M Mitchell, and Brad A Myers

  47. [47]

    InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology

    Multi-modal repairs of conversational breakdowns in task-oriented dialogs. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 1094–1107

  48. [48]

    Henry Lieberman. 1997. Autonomous interface agents. InProceedings of the ACM SIGCHI Conference on Human factors in computing systems. 67–74

  49. [49]

    Jijia Liu, Chao Yu, Jiaxuan Gao, Yuqing Xie, Qingmin Liao, Yi Wu, and Yu Wang. 2023. Llm-powered hierarchical language agent for real-time human-ai coordination.arXiv preprint arXiv:2312.15224(2023)

  50. [50]

    Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. 2023. Esti- mating the carbon footprint of bloom, a 176b parameter language model.Journal of machine learning research24, 253 (2023), 1–15

  51. [51]

    Reuben Luera, Ryan A Rossi, Alexa Siu, Franck Dernoncourt, Tong Yu, Sungchul Kim, Ruiyi Zhang, Xiang Chen, Hanieh Salehy, Jian Zhao, et al. 2024. Survey of User Interface Design and Interaction Techniques in Generative AI Applications. arXiv preprint arXiv:2410.22370(2024), 1–42

  52. [52]

    Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. 2024. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems37 (2024), 94871–94908

  53. [53]

    Pattie Maes. 1995. Agents that reduce work and information overload. In Readings in human–computer interaction. Elsevier, 811–821

  54. [54]

    Manus. 2024. Leave it to Manus: General AI Agent for Work and Life. https: //manus.im/home. Accessed: July 2025

  55. [55]

    2024.Browser Use: Enable AI to control your browser

    Magnus Müller and Gregor Žunič. 2024.Browser Use: Enable AI to control your browser. https://github.com/browser-use/browser-use

  56. [56]

    Dana S Nau, Tsz-Chiu Au, Okhtay Ilghami, Ugur Kuter, J William Murdock, Dan Wu, and Fusun Yaman. 2003. SHOP2: An HTN planning system.Journal of artificial intelligence research20 (2003), 379–404

  57. [57]

    OpenAI. 2025. Introducing ChatGPT agent: bridging research and action. https: //openai.com/index/introducing-chatgpt-agent. Accessed: July 2025

  58. [58]

    Vishal Pallagani, Bharath Chandra Muppasani, Kaushik Roy, Francesco Fabiano, Andrea Loreggia, Keerthiram Murugesan, Biplav Srivastava, Francesca Rossi, Lior Horesh, and Amit Sheth. 2024. On the prospects of incorporating large lan- guage models (llms) in automated planning and scheduling (aps). InProceedings of the International Conference on Automated Pl...

  59. [59]

    Raja Parasuraman and Dietrich H Manzey. 2010. Complacency and bias in human use of automation: An attentional integration.Human factors52, 3 (2010), 381–410

  60. [60]

    David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis- Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and Jeff Dean

  61. [61]

    The carbon footprint of machine learning training will plateau, then shrink.Computer55, 7 (2022), 18–28

  62. [62]

    David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350(2021)

  63. [63]

    Philip Quinn and Andy Cockburn. 2016. When bad feels good: Assistance failures and interface preferences. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 4005–4010

  64. [64]

    Philip Quinn and Andy Cockburn. 2020. Loss aversion and preferences in interaction.Human–Computer Interaction35, 2 (2020), 143–190

  65. [65]

    rabbit research team. 2023. Learning human actions on computer applications. https://rabbit.tech/research

  66. [66]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems36 (2023), 68539–68551

  67. [67]

    Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, and Diyi Yang. 2024. Col- laborative gym: A framework for enabling and evaluating human-agent collab- oration.arXiv preprint arXiv:2412.15701(2024)

  68. [68]

    Vikas Shivashankar, Ron Alford, Ugur Kuter, and Dana S Nau. 2013. Hierarchical goal networks and goal-driven autonomy: Going where ai planning meets goal reasoning. InGoal Reasoning: Papers from the ACS Workshop, Vol. 95

  69. [69]

    Ben Shneiderman. 2020. Human-centered artificial intelligence: Reliable, safe & trustworthy.International Journal of Human–Computer Interaction36, 6 (2020), 495–504

  70. [70]

    Ben Shneiderman and Pattie Maes. 1997. Direct manipulation vs. interface agents.interactions4, 6 (1997), 42–61

  71. [71]

    Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. 2024. Chain of thoughtlessness? an analysis of cot in planning.Advances in Neural Information Processing Systems37 (2024), 29106–29141

  72. [72]

    Sangho Suh, Meng Chen, Bryan Min, Toby Jia-Jun Li, and Haijun Xia. 2024. Luminate: Structured generation and exploration of design space with large language models for human-ai co-creation. InProceedings of the 2024 CHI Con- ference on Human Factors in Computing Systems. 1–26

  73. [73]

    Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. 2023. Cognitive architectures for language agents.Transactions on Machine Learning Research(2023)

  74. [74]

    Sharareh Taghipour and Dragan Banjevic. 2012. Optimal inspection of a com- plex system subject to periodic and opportunistic inspections and preventive replacements.European Journal of Operational Research220, 3 (2012), 649–660

  75. [75]

    Mike Ten Wolde and Adel A Ghobbar. 2013. Optimizing inspection inter- vals—Reliability and availability in terms of a cost model: A case study on railway carriers.Reliability Engineering & System Safety114 (2013), 137–147

  76. [76]

    David Tennenhouse. 2000. Proactive computing.Commun. ACM43, 5 (2000), 43–50

  77. [77]

    Sebastian Thrun. 2002. Probabilistic robotics.Commun. ACM45, 3 (2002), 52–57. Conference’17, July 2017, Washington, DC, USA Jieyu Zhou, Aryan Roy, Sneh Gupta, Daniel Weitekamp, and Christopher J. MacLellan

  78. [78]

    Kashyap Todi, Gilles Bailly, Luis Leiva, and Antti Oulasvirta. 2021. Adapting user interfaces with model-based reinforcement learning. InProceedings of the 2021 CHI conference on human factors in computing systems. 1–13

  79. [79]

    Browser Use. 2024. The AI browser agent. https://browser-use.com. Accessed: July 2025

  80. [80]

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Planbench: An extensible benchmark for evaluat- ing large language models on planning and reasoning about change.Advances in Neural Information Processing Systems36 (2023), 38975–38987

Showing first 80 references.