Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents

arxiv: 2509.14528 · v2 · submitted 2025-09-18 · 💻 cs.HC

Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents

Pradyumna Shome , Sashreek Krishnan , Sauvik Das This is my paper

Pith reviewed 2026-05-18 16:52 UTC · model grok-4.3

classification 💻 cs.HC

keywords AI agentsusabilitymental modelsmeta-cognitionhuman-AI collaborationorchestrationcontent creationinsight generation

0 comments p. Extension

The pith

Commercial AI agents fall short in practice because their capabilities do not match user mental models and they lack skills for effective collaboration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how the tech industry markets AI agents and whether end users can actually use them for those advertised purposes. A review of 102 commercial agents grouped their promised uses into three broad areas: orchestration of tasks, creation of content, and generation of insights. Usability tests with 31 participants on two popular tools showed that users were often impressed yet ran into repeated problems when trying representative tasks from each area. The core issues were agents behaving in ways that did not fit how people expect tools to work and agents failing to monitor or adjust their own actions during joint work. These gaps point to a need for better alignment between what agents are sold as doing and what humans can realistically get them to do.

Core claim

After mapping marketed capabilities across 102 agents into orchestration, creation, and insight categories, testing on two tools found that users generally liked the agents but encountered significant usability problems. These included agent actions that conflicted with users' mental models of how the tools should operate and a lack of meta-cognitive abilities required for productive human-agent teamwork.

What carries the argument

The gap between three umbrella categories of marketed AI agent uses and the concrete difficulties users encountered when performing representative tasks in each category during usability assessments.

If this is right

Designers need to make agent behavior more predictable so it matches how users already think about task delegation.
Agents require built-in ways to reflect on their progress and communicate uncertainties during ongoing work.
Marketed promises for agents in orchestration, creation, and insight should be tempered until these collaboration gaps close.
Everyday adoption of agents for professional or personal tasks will remain limited until the observed usability barriers are addressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mental-model and meta-cognition shortfalls may appear in other human-AI systems beyond the two tools studied.
Future agent releases could be evaluated against explicit checklists for mental-model alignment before launch.
If these issues persist, organizations may need to invest in user training or hybrid human oversight rather than full agent autonomy.

Load-bearing premise

The chosen tasks and the two tested agents accurately stand in for the full range of marketed capabilities and for typical end-user experiences across commercial AI agents.

What would settle it

A follow-up study with more participants, a wider set of agents, and success rates above 80 percent on similar marketed tasks without the reported mental-model or collaboration problems would undermine the central claim.

Figures

Figures reproduced from arXiv: 2509.14528 by Pradyumna Shome, Sashreek Krishnan, Sauvik Das.

**Figure 1.** Figure 1: We identify five critical usability barriers novice users face when interacting with AI agents to accomplish a task. As [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of our research process. We conducted a systematic review to build a taxonomy of marketed use cases of AI [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Manus, one of our AI Agents, working on our Holiday Planning task. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The initial screen on Operator, featuring a text box [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Manus, one of our AI Agents, operating a computer. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: OpenAI Operator, sharing a sample budget with a [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Mean System Usability Scale (SUS) [20] Scores by Task and Agent. The average experience with our chosen agents and tasks was interpreted generally as Good (70-80) and Excellent (80-90), with Slide Making on Operator (69.2, Okay) and on Manus (90.6, Best Imaginable) being notable exceptions. task — but the agents they used would assume that this first step was the full task, leaving users feeling trapped. O… view at source ↗

**Figure 9.** Figure 9: However, many users struggled with recovering from [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 9.** Figure 9: OpenAI Operator, working on the slide making [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Mapping five empirically discovered usability barriers to six implications for next-generation AI agent design. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

There is growing imprecision about what "AI agents" are, what they can do, and how effectively they can be used by their intended users. We pose two key research questions: (i) How does the tech industry conceive and market "AI agents"? (ii) What challenges do end-users face when attempting to use commercial AI agents for their advertised uses? We first performed a systematic review of marketed use cases for 102 commercial AI agents, finding that they fall into three umbrella categories: orchestration, creation, and insight. We then evaluated whether end-users could realize these marketed capabilities in practice: we conducted a usability assessment where N = 31 participants attempted representative tasks for each of these categories on two popular commercial AI agent tools: Operator and Manus. We found that users were generally impressed with these agents but faced significant usability challenges ranging from agent capabilities that were misaligned with user mental models to agents lacking the meta-cognitive abilities necessary for effective collaboration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The categorization of 102 agents into three marketed buckets is the useful part; the usability claims from N=31 on two tools do not yet carry the weight the abstract gives them.

read the letter

The paper's clearest contribution is the systematic review that bins 102 commercial AI agents into orchestration, creation, and insight categories based on what their marketing says they can do. That framing is straightforward and gives readers a practical way to compare claims across tools. The usability assessment then tries to check whether real users can actually do those things on Operator and Manus, and it surfaces concrete issues like mismatched mental models and weak meta-cognition for collaboration. Those observations match what many of us hear in demos and early deployments, so the direction feels honest even if the numbers are limited.

Referee Report

3 major / 1 minor

Summary. The paper conducts a systematic review of 102 commercial AI agents and classifies their marketed use cases into three umbrella categories (orchestration, creation, and insight). It then reports a usability study in which N=31 participants attempted representative tasks from these categories on two commercial tools (Operator and Manus), concluding that users are generally impressed but encounter significant challenges including misalignment between agent capabilities and user mental models as well as insufficient meta-cognitive abilities for effective collaboration.

Significance. If the empirical findings hold after methodological clarification, the work offers a useful empirical contrast between industry marketing claims and observed user difficulties, which could inform agent design and evaluation practices. The broad review of 102 agents provides a reasonable foundation for identifying categories, but the strength of the central claim rests on the usability component.

major comments (3)

[Usability assessment (methods)] The description of the usability assessment provides no information on participant recruitment, task operationalization for each category, data analysis methods, or bias controls. Without these details it is not possible to assess whether the reported challenges are reliably supported by the observations.
[Task selection and category coverage] No explicit mapping is given between the most frequent marketed capabilities identified in the review of 102 agents and the specific representative tasks selected for Operator and Manus. This gap prevents the observed usability issues from being taken as evidence about the advertised capabilities across the three categories.
[Generalization from the usability study] The study evaluates only two tools. The manuscript does not demonstrate that Operator and Manus are representative of the broader set of 102 agents or of commercial AI agents in general, which limits the scope of the claim about 'end-user realities with AI agents'.

minor comments (1)

[Abstract] The abstract states that a 'systematic review' was performed but does not summarize the search strategy, inclusion criteria, or coding process used to arrive at the 102 agents and three categories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below, indicating the revisions we plan to make to improve the clarity and rigor of our work.

read point-by-point responses

Referee: The description of the usability assessment provides no information on participant recruitment, task operationalization for each category, data analysis methods, or bias controls. Without these details it is not possible to assess whether the reported challenges are reliably supported by the observations.

Authors: We agree that the current description of the usability assessment lacks sufficient methodological detail. In the revised version of the manuscript, we will substantially expand this section to include comprehensive information on participant recruitment (including the recruitment platform, screening criteria, and sample demographics), the operationalization of tasks for each of the three categories with specific examples, the data analysis methods employed (including both quantitative measures and qualitative thematic analysis), and the steps taken to mitigate potential biases such as counterbalancing task order and ensuring inter-coder reliability. These additions will enable readers to better evaluate the robustness of our findings. revision: yes
Referee: No explicit mapping is given between the most frequent marketed capabilities identified in the review of 102 agents and the specific representative tasks selected for Operator and Manus. This gap prevents the observed usability issues from being taken as evidence about the advertised capabilities across the three categories.

Authors: We appreciate this observation and will address it by adding an explicit mapping in the revised manuscript. We will include a new table or detailed description that connects the most prevalent capabilities identified in our systematic review of 102 agents to the representative tasks used in the usability study for Operator and Manus. This will clarify how the selected tasks reflect the marketed use cases in the orchestration, creation, and insight categories, thereby strengthening the link between the review and the empirical evaluation. revision: yes
Referee: The study evaluates only two tools. The manuscript does not demonstrate that Operator and Manus are representative of the broader set of 102 agents or of commercial AI agents in general, which limits the scope of the claim about 'end-user realities with AI agents'.

Authors: We acknowledge the limitation inherent in evaluating only two specific tools. In the revised manuscript, we will more explicitly discuss this in the limitations and future work sections, qualifying our claims to reflect that the study provides insights from two prominent commercial agents rather than a comprehensive sample of all 102 reviewed agents. We selected Operator and Manus as they are widely recognized and actively marketed examples that span the identified categories, but we will avoid overgeneralizing and instead frame the findings as indicative of current challenges in the field. We believe this approach maintains the value of the contrast between industry aspirations and user realities while being transparent about scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical review and usability study with independent observations

full rationale

The paper performs a systematic review of 102 commercial AI agents to derive three umbrella categories and then conducts a usability assessment with N=31 participants on two tools. No mathematical derivations, fitted parameters, predictions, or self-citations are used as load-bearing steps. All claims rest on direct participant data and observed behaviors rather than any reduction to inputs by construction, self-definition, or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard HCI usability testing practices and the representativeness of selected tasks and participants. No free parameters or invented entities are introduced; the work applies established empirical methods to a new application area.

axioms (1)

domain assumption Established usability assessment methods from HCI literature are suitable for evaluating commercial AI agent tools.
The study design relies on conventional participant-based task evaluation without justifying or adapting the approach for AI-specific contexts.

pith-pipeline@v0.9.0 · 5704 in / 1398 out tokens · 42095 ms · 2026-05-18T16:52:21.843635+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 9 internal anchors

[1]

Beautiful.ai

2025. Beautiful.ai. https://www.beautiful.ai/. Accessed: 2025-09-12

work page 2025
[2]

2025. Cleo AI. https://web.meetcleo.com/. Accessed: 2025-09-11

work page 2025
[3]

Anysphere

Anysphere 2025.Cursor - The AI Code Editor. Anysphere. https://cursor.com/ Accessed: 2025-09-01

work page 2025
[4]

Expedia Trip Matching AI Agent

2025. Expedia Trip Matching AI Agent. https://www.marketingdive.com/news/ expedia-brings-generative-ai-trip-planning-tool-to-instagram/748233/. Ac- cessed: 2025-09-12

work page 2025
[5]

Gamma Presentation Builder

2025. Gamma Presentation Builder. https://gamma.app/. Accessed: 2025-09-12

work page 2025
[6]

Genspark

2025. Genspark. https://www.genspark.ai/. Accessed: 2025-09-12

work page 2025
[7]

LiveX AI Agent

2025. LiveX AI Agent. https://livex.ai/. Accessed: 2025-09-12

work page 2025
[8]

https://lovable.dev/ Accessed: 2025-09-01

2025.Lovable. https://lovable.dev/ Accessed: 2025-09-01

work page 2025
[9]

Perplexity AI

2025. Perplexity AI. https://www.perplexity.ai/. Accessed: 2025-09-11

work page 2025
[10]

Salesforce Agentforce

2025. Salesforce Agentforce. https://www.salesforce.com/agentforce/. Accessed: 2025-09-12

work page 2025
[11]

Eytan Adar. 2018. Bounced Checks at the UI/AI Intersection. InHuman-Computer Interaction Consortium Workshop (HCIC’18)

work page 2018
[12]

2024.Introducing Perplexity Deep Research

Perplexity AI. 2024.Introducing Perplexity Deep Research. https://www.perplexity. ai/hub/blog/introducing-perplexity-deep-research Accessed: 2025-09-02

work page 2024
[13]

AI Agents Directory. 2025. AI Agent Marketplace & Directory | Find Top AI Agents & AI Agent Solutions. https://aiagentsdirectory.com/ Accessed: 2025-09- 01

work page 2025
[14]

White, and Eric Horvitz

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associa...

work page doi:10.1145/3290605.3300233 2019
[15]

Lasecki, Daniel S

Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S. Lasecki, Daniel S. Weld, and Eric Horvitz. 2019. Beyond Accuracy: The Role of Mental Models in Human-AI arXiv ’25, 2025, Pittsburgh, PA, United States Shome et al. Team Performance.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing7, 1 (Oct. 2019), 2–11. doi:10.1609/hcomp.v7i1.5285

work page doi:10.1609/hcomp.v7i1.5285 2019
[16]

Gagan Bansal, Jennifer Wortman Vaughan, Saleema Amershi, Eric Horvitz, Adam Fourney, Hussein Mozannar, Victor Dibia, and Daniel S. Weld. 2024. Challenges in Human-Agent Communication. arXiv:2412.10380 [cs.HC] https://arxiv.org/ abs/2412.10380

work page arXiv 2024
[17]

Beam.ai. 2025. Budget Generation | AI Agents & Agentic Workflows. https: //beam.ai/workflows/budget-generation Accessed: 2025-09-12

work page 2025
[18]

Fabio Bellifemine, Agostino Poggi, and Giovanni Rimassa. 2000. Developing multi-agent systems with JADE. InInternational workshop on agent theories, architectures, and languages. Springer, 89–103

work page 2000
[19]

B. S. Bloom, M. B. Engelhart, E. J. Furst, W. H. Hill, and D. R. Krathwohl. 1956.Tax- onomy of educational objectives. The classification of educational goals. Handbook 1: Cognitive domain. Longmans Green, New York

work page 1956
[20]

John Brooke et al. 1996. SUS-A quick and dirty usability scale.Usability evaluation in industry189, 194 (1996), 4–7

work page 1996
[21]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[22]

Vannevar Bush et al. 1945. As We May Think.The Atlantic monthly176, 1 (1945), 101–108

work page 1945
[23]

Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, and Tat-Seng Chua. 2025. Large Language Models Empowered Personalized Web Agents. InProceedings of the ACM on Web Conference 2025(Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, NY, USA, 198–215. doi:10.1145/3696410.3714842

work page doi:10.1145/3696410.3714842 2025
[24]

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning

work page 2024
[25]

1996.Microsoft FrontPage

Microsoft Corporation. 1996.Microsoft FrontPage. https://en.wikipedia.org/wiki/ Microsoft_FrontPage Web authoring software

work page 1996
[26]

Justin Cranshaw, Emad Elwany, Todd Newman, Rafal Kocielnik, Bowen Yu, Sandeep Soni, Jaime Teevan, and Andrés Monroy-Hernández. 2017. Calendar.help: Designing a Workflow-Based Scheduling Agent with Humans in the Loop. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, Denver, CO, USA, 2382–2393. doi:10.1145/3025453.3025780

work page doi:10.1145/3025453.3025780 2017
[27]

Halbert, David Kurlander, Henry Lieberman, David Maulsby, Brad A

Allen Cypher, Daniel C. Halbert, David Kurlander, Henry Lieberman, David Maulsby, Brad A. Myers, and Alan Turransky (Eds.). 1993.Watch what I do: programming by demonstration. MIT Press, Cambridge, MA, USA

work page 1993
[28]

F. S. de Boer, K. V. Hindriks, W. van der Hoek, and J. J. Ch. Meyer. 2002. Agent Programming with Declarative Goals. arXiv:cs/0207008 [cs.AI] https://arxiv.org/ abs/cs/0207008

work page internal anchor Pith review Pith/arXiv arXiv 2002
[29]

Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Inter- pretable Machine Learning. arXiv:1702.08608 [stat.ML] https://arxiv.org/abs/ 1702.08608

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. 2024. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? arXiv:2403.07718 [cs.LG] https://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Alpana Dubey, Kumar Abhinav, Sakshi Jain, Veenu Arora, and Asha Puttaveer- ana. 2020. HACO: A Framework for Developing Human-AI Teaming. In13th Innovations in Software Engineering Conference (ISEC 2020). ACM, Jabalpur, India, 1–9. doi:10.1145/3385032.3385044

work page doi:10.1145/3385032.3385044 2020
[32]

Upol Ehsan, Philipp Wintersberger, Q Vera Liao, Elizabeth Anne Watkins, Carina Manger, Hal Daumé III, Andreas Riener, and Mark O Riedl. 2022. Human-Centered Explainable AI (HCXAI): beyond opening the black-box of AI. InCHI conference on human factors in computing systems extended abstracts. 1–7

work page 2022
[33]

Dylan Freedman. 2025. The Day ChatGPT Went Cold.The New York Times (2025). https://www.nytimes.com/2025/08/19/business/chatgpt-gpt-5-backlash- openai.html Accessed: 2025-09-04

work page 2025
[34]

Gould, John Conti, and Todd Hovanyecz

John D. Gould, John Conti, and Todd Hovanyecz. 1983. Composing letters with a simulated listening typewriter.Commun. ACM26, 4 (April 1983), 295–308. doi:10.1145/2163.358100

work page doi:10.1145/2163.358100 1983
[35]

Ryan Hoover. 2013. Product Hunt. https://www.producthunt.com/. Accessed: 2025-09-01

work page 2013
[36]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces(CHI ’99). Asso- ciation for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/ 302979.303030

work page arXiv 1999
[37]

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...

work page
[38]

arXiv:2508.04482 [cs.AI] https://arxiv.org/abs/2508.04482

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use. arXiv:2508.04482 [cs.AI] https://arxiv.org/abs/2508.04482

work page arXiv
[39]

Hutchins, James D

Edwin L. Hutchins, James D. Hollan, and Donald A. Norman. 1985. Direct Manipulation Interfaces. InUser Centered System Design: New Perspectives on Human-Computer Interaction, Donald A. Norman and Stephen W. Draper (Eds.). Lawrence Erlbaum Associates, Hillsdale, NJ, 87–124

work page 1985
[40]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. 2024. AI Agents That Matter. arXiv:2407.01502 [cs.LG] https: //arxiv.org/abs/2407.01502

work page arXiv 2024
[42]

Garry Kasparov. 2010. The Chess Master and the Computer.The New York Review of Books(2010). https://www.nybooks.com/articles/2010/02/11/the-chess- master-and-the-computer/

work page 2010
[43]

Amershi, D

Harmanpreet Kaur, Alex C. Williams, and Walter S. Lasecki. 2019. Building Shared Mental Models between Humans and AI for Effective Collaboration. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland). Association for Computing Machinery. doi:10.1145/3290605. 3300643

work page doi:10.1145/3290605 2019
[44]

Hancock, and Michael S

Pranav Khadpe, Ranjay Krishna, Li Fei-Fei, Jeffrey T. Hancock, and Michael S. Bernstein. 2020. Conceptual Metaphors Impact Perceptions of Human-AI Collab- oration.Proc. ACM Hum.-Comput. Interact.4, CSCW2, Article 163 (Oct. 2020), 26 pages. doi:10.1145/3415234

work page doi:10.1145/3415234 2020
[45]

I’m Not Sure, But

Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. 2024. "I’m Not Sure, But... ": Examining the Im- pact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. InProceedings of the 7th ACM Conference on Fairness, Accountability, and Transparency (FAccT 2024)

work page 2024
[46]

Ko, Brad A

Andrew J. Ko, Brad A. Myers, and Htet Htet Aung. 2004. Six Learning Barriers in End-User Programming Systems. InProceedings of the 2004 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 199–206. doi:10. 1109/VLHCC.2004.47

work page 2004
[47]

Hui Li, Jiasheng Zhang, and Kun Huang. 2025. Meta-Analyzing the Trust- Performance Link in Collaboration: Moderating Effects of Conceptual and Con- textual Factors.Public Performance & Management Review48, 1 (2025), 1–34. arXiv:https://doi.org/10.1080/15309576.2024.2405839 doi:10.1080/15309576.2024. 2405839

work page doi:10.1080/15309576.2024.2405839 2025
[48]

Q Vera Liao and Kush R Varshney. 2021. Human-centered explainable ai (xai): From algorithms to user experiences.arXiv preprint arXiv:2110.10790(2021)

work page arXiv 2021
[49]

J. C. R. Licklider. 1960. Man-Computer Symbiosis.IRE Transactions on Human Factors in ElectronicsHFE-1, 1 (1960), 4–11. doi:10.1109/THFE2.1960.4503259

work page doi:10.1109/thfe2.1960.4503259 1960
[50]

Henry Lieberman. 1997. Autonomous Interface Agents. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’97). ACM, Atlanta, Georgia, USA, 67–74. doi:10.1145/258549.258592

work page doi:10.1145/258549.258592 1997
[51]

Henry Lieberman and Ted Selker. 1999. Agents for the User Interface. InProceed- ings of the ACM Conference on Human Factors in Computing Systems (CHI). Media Laboratory, Massachusetts Institute of Technology. https://web.media.mit.edu/ ~lieber/Publications/Agents_for_UI.pdf

work page 1999
[52]

Deloitte Consulting LLP. 2025. AWS Marketplace: Care Finder Agent. https://aws. amazon.com/marketplace/pp/prodview-mau5avpo5xog6. https://aws.amazon. com/marketplace/pp/prodview-mau5avpo5xog6 Product listing for Care Finder Agent, an AI-powered healthcare navigation solution by Deloitte

work page 2025
[53]

Pattie Maes. 1994. Agents that reduce work and information overload.Commun. ACM37, 7 (July 1994), 30–40. doi:10.1145/176789.176792

work page doi:10.1145/176789.176792 1994
[54]

Manus AI. 2025. Manus: General AI agent that bridges mind and action. https: //manus.im/?index=1. Accessed: 2025-09-02

work page 2025
[55]

David Maulsby, Saul Greenberg, and Richard Mander. 1993. Prototyping an intelligent agent through Wizard of Oz. InProceedings of the INTERACT’93 and CHI’93 Conference on Human Factors in Computing Systems. 277–284

work page 1993
[56]

Merritt, Kian Boon Tan, Christopher Ong, Aswin Thomas, Teong Leong Chuah, and Kevin McGee

Tim R. Merritt, Kian Boon Tan, Christopher Ong, Aswin Thomas, Teong Leong Chuah, and Kevin McGee. 2011. Are artificial team-mates scapegoats in com- puter games. InProceedings of the ACM 2011 Conference on Computer Supported Cooperative Work(Hangzhou, China)(CSCW ’11). Association for Computing Machinery, New York, NY, USA, 685–688. doi:10.1145/1958824.1958945

work page doi:10.1145/1958824.1958945 2011
[57]

Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence267 (2019), 1–38. doi:10.1016/j.artint.2018.07.007

work page doi:10.1016/j.artint.2018.07.007 2019
[58]

2023.Spotify Debuts a New AI DJ, Right in Your Pocket

Spotify Newsroom. 2023.Spotify Debuts a New AI DJ, Right in Your Pocket. https://newsroom.spotify.com/2023-02-22/spotify-debuts-a-new-ai-dj- right-in-your-pocket/ Accessed: 2025-09-12

work page 2023
[59]

Christopher Ong, Kevin McGee, and Teong Leong Chuah. 2012. Closing the human-AI team-mate gap: how changes to displayed information impact player behavior towards computer teammates. InProceedings of the 24th Australian Computer-Human Interaction Conference. 433–439

work page 2012
[60]

2025.Introducing ChatGPT agent: bridging research and action

OpenAI. 2025.Introducing ChatGPT agent: bridging research and action. https: //openai.com/index/introducing-chatgpt-agent/ Accessed: 2025-09-02. Why Johnny Can’t Use Agents: Industry Aspirations vs. User Realities with AI Agent Software arXiv ’25, 2025, Pittsburgh, PA, United States

work page 2025
[61]

OpenAI. 2025. Introducing Operator | OpenAI. https://openai.com/index/ introducing-operator/. Accessed: 2025-09-02

work page 2025
[62]

Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, and Zhengyang Wu. 2024. WebCanvas: Benchmarking Web Agents in Online Environments. arXiv:2406.12373 [cs.CL] https://arxiv.org/abs/2406.12373

work page arXiv 2024
[63]

Minjung Park, Jodi Forlizzi, and John Zimmerman. 2025. Exploring the Innovation Opportunities for Pre-trained Models. InProceedings of the 2025 ACM Designing Interactive Systems Conference (DIS ’25). 1973–2005. doi:10.1145/3715336.3735753

work page doi:10.1145/3715336.3735753 2025
[64]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al . 2024. Androidworld: A dynamic benchmarking environment for au- tonomous agents.arXiv preprint arXiv:2405.14573(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Matt Renner and Matt A.V. Chaban. 2025.601 real-world gen AI use cases from the world’s leading organizations. https://cloud.google.com/transform/101-real- world-generative-ai-use-cases-from-industry-leaders Published April 12, 2024; last updated April 9, 2025. Google Cloud

work page 2025
[67]

Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, and Sanmi Koyejo. 2025. Measurement to Meaning: A Validity-Centered Framework for AI Evaluation. arXiv:2505.10573 [cs.CY] https://arxiv.org/abs/2505.10573

work page arXiv 2025
[68]

Devansh Saxena, Ji-Youn Jung, Jodi Forlizzi, Kenneth Holstein, and John Zim- merman. 2025. AI Mismatches: Identifying Potential Algorithmic Harms Before AI Development. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–23. doi:10.1145/3706598.3714098

work page doi:10.1145/3706598.3714098 2025
[69]

Ben Shneiderman. 1983. Direct manipulation: A step beyond programming languages.Computer16, 08 (1983), 57–69

work page 1983
[70]

On human-centered artificial intelligence

Ben Shneiderman. 2022.Human-centered AI. Oxford University Press

work page 2022
[71]

Ben Shneiderman and Pattie Maes. 1997. Direct manipulation vs. interface agents. Interactions4, 6 (Nov. 1997), 42–61. doi:10.1145/267505.267514

work page doi:10.1145/267505.267514 1997
[72]

Yoav Shoham. 1993. Agent-oriented programming.Artificial Intelligence60, 1 (1993), 51–92. doi:10.1016/0004-3702(93)90034-9

work page doi:10.1016/0004-3702(93)90034-9 1993
[73]

Learning Too Much About Me

Pradyumna Shome and Miuyin Marie Yong Wong. 2024. "Learning Too Much About Me": A User Study on the Security and Privacy of Generative AI Chatbots. InProceedings of the Twentieth Symposium on Usable Privacy and Security (SOUPS

work page 2024
[74]

USENIX Association, Philadelphia, PA, USA

— Poster Session. USENIX Association, Philadelphia, PA, USA. https: //www.usenix.org/conference/soups2024/presentation/shome-poster Poster

work page
[75]

Neville A Stanton. 2006. Hierarchical task analysis: Developments, applications, and extensions.Applied ergonomics37, 1 (2006), 55–79

work page 2006
[76]

Leon Staufer, Mick Yang, Anka Reuel, and Stephen Casper. 2025. Audit Cards: Contextualizing AI Evaluations. arXiv:2504.13839 [cs.CY] https://arxiv.org/abs/ 2504.13839

work page arXiv 2025
[77]

Blase Ur, Melwyn Pak Yong Ho, Stephen Brawner, Jiyun Lee, Sarah Mennicken, Noah Picard, Diane Schulze, and Michael L Littman. 2016. Trigger-action pro- gramming in the wild: An analysis of 200,000 ifttt recipes. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 3227–3231

work page 2016
[78]

Hanna Wallach, Meera Desai, A Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P Alex Dow, et al. 2025. Position: Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2502.00561(2025)

work page arXiv 2025
[79]

Alma Whitten and J Doug Tygar. 1999. Why Johnny Can’t Encrypt: A Usability Evaluation of PGP 5.0.. InUSENIX Security Symposium, Vol. 348. 169–184

work page 1999
[80]

Wikipedia contributors. 2025. Amazon Alexa. https://en.wikipedia.org/wiki/ Amazon_Alexa. Accessed: 2025-09-11

work page 2025

Showing first 80 references.

[1] [1]

Beautiful.ai

2025. Beautiful.ai. https://www.beautiful.ai/. Accessed: 2025-09-12

work page 2025

[2] [2]

2025. Cleo AI. https://web.meetcleo.com/. Accessed: 2025-09-11

work page 2025

[3] [3]

Anysphere

Anysphere 2025.Cursor - The AI Code Editor. Anysphere. https://cursor.com/ Accessed: 2025-09-01

work page 2025

[4] [4]

Expedia Trip Matching AI Agent

2025. Expedia Trip Matching AI Agent. https://www.marketingdive.com/news/ expedia-brings-generative-ai-trip-planning-tool-to-instagram/748233/. Ac- cessed: 2025-09-12

work page 2025

[5] [5]

Gamma Presentation Builder

2025. Gamma Presentation Builder. https://gamma.app/. Accessed: 2025-09-12

work page 2025

[6] [6]

Genspark

2025. Genspark. https://www.genspark.ai/. Accessed: 2025-09-12

work page 2025

[7] [7]

LiveX AI Agent

2025. LiveX AI Agent. https://livex.ai/. Accessed: 2025-09-12

work page 2025

[8] [8]

https://lovable.dev/ Accessed: 2025-09-01

2025.Lovable. https://lovable.dev/ Accessed: 2025-09-01

work page 2025

[9] [9]

Perplexity AI

2025. Perplexity AI. https://www.perplexity.ai/. Accessed: 2025-09-11

work page 2025

[10] [10]

Salesforce Agentforce

2025. Salesforce Agentforce. https://www.salesforce.com/agentforce/. Accessed: 2025-09-12

work page 2025

[11] [11]

Eytan Adar. 2018. Bounced Checks at the UI/AI Intersection. InHuman-Computer Interaction Consortium Workshop (HCIC’18)

work page 2018

[12] [12]

2024.Introducing Perplexity Deep Research

Perplexity AI. 2024.Introducing Perplexity Deep Research. https://www.perplexity. ai/hub/blog/introducing-perplexity-deep-research Accessed: 2025-09-02

work page 2024

[13] [13]

AI Agents Directory. 2025. AI Agent Marketplace & Directory | Find Top AI Agents & AI Agent Solutions. https://aiagentsdirectory.com/ Accessed: 2025-09- 01

work page 2025

[14] [14]

White, and Eric Horvitz

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associa...

work page doi:10.1145/3290605.3300233 2019

[15] [15]

Lasecki, Daniel S

Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S. Lasecki, Daniel S. Weld, and Eric Horvitz. 2019. Beyond Accuracy: The Role of Mental Models in Human-AI arXiv ’25, 2025, Pittsburgh, PA, United States Shome et al. Team Performance.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing7, 1 (Oct. 2019), 2–11. doi:10.1609/hcomp.v7i1.5285

work page doi:10.1609/hcomp.v7i1.5285 2019

[16] [16]

Gagan Bansal, Jennifer Wortman Vaughan, Saleema Amershi, Eric Horvitz, Adam Fourney, Hussein Mozannar, Victor Dibia, and Daniel S. Weld. 2024. Challenges in Human-Agent Communication. arXiv:2412.10380 [cs.HC] https://arxiv.org/ abs/2412.10380

work page arXiv 2024

[17] [17]

Beam.ai. 2025. Budget Generation | AI Agents & Agentic Workflows. https: //beam.ai/workflows/budget-generation Accessed: 2025-09-12

work page 2025

[18] [18]

Fabio Bellifemine, Agostino Poggi, and Giovanni Rimassa. 2000. Developing multi-agent systems with JADE. InInternational workshop on agent theories, architectures, and languages. Springer, 89–103

work page 2000

[19] [19]

B. S. Bloom, M. B. Engelhart, E. J. Furst, W. H. Hill, and D. R. Krathwohl. 1956.Tax- onomy of educational objectives. The classification of educational goals. Handbook 1: Cognitive domain. Longmans Green, New York

work page 1956

[20] [20]

John Brooke et al. 1996. SUS-A quick and dirty usability scale.Usability evaluation in industry189, 194 (1996), 4–7

work page 1996

[21] [21]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[22] [22]

Vannevar Bush et al. 1945. As We May Think.The Atlantic monthly176, 1 (1945), 101–108

work page 1945

[23] [23]

Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, and Tat-Seng Chua. 2025. Large Language Models Empowered Personalized Web Agents. InProceedings of the ACM on Web Conference 2025(Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, NY, USA, 198–215. doi:10.1145/3696410.3714842

work page doi:10.1145/3696410.3714842 2025

[24] [24]

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning

work page 2024

[25] [25]

1996.Microsoft FrontPage

Microsoft Corporation. 1996.Microsoft FrontPage. https://en.wikipedia.org/wiki/ Microsoft_FrontPage Web authoring software

work page 1996

[26] [26]

Justin Cranshaw, Emad Elwany, Todd Newman, Rafal Kocielnik, Bowen Yu, Sandeep Soni, Jaime Teevan, and Andrés Monroy-Hernández. 2017. Calendar.help: Designing a Workflow-Based Scheduling Agent with Humans in the Loop. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, Denver, CO, USA, 2382–2393. doi:10.1145/3025453.3025780

work page doi:10.1145/3025453.3025780 2017

[27] [27]

Halbert, David Kurlander, Henry Lieberman, David Maulsby, Brad A

Allen Cypher, Daniel C. Halbert, David Kurlander, Henry Lieberman, David Maulsby, Brad A. Myers, and Alan Turransky (Eds.). 1993.Watch what I do: programming by demonstration. MIT Press, Cambridge, MA, USA

work page 1993

[28] [28]

F. S. de Boer, K. V. Hindriks, W. van der Hoek, and J. J. Ch. Meyer. 2002. Agent Programming with Declarative Goals. arXiv:cs/0207008 [cs.AI] https://arxiv.org/ abs/cs/0207008

work page internal anchor Pith review Pith/arXiv arXiv 2002

[29] [29]

Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Inter- pretable Machine Learning. arXiv:1702.08608 [stat.ML] https://arxiv.org/abs/ 1702.08608

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. 2024. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? arXiv:2403.07718 [cs.LG] https://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Alpana Dubey, Kumar Abhinav, Sakshi Jain, Veenu Arora, and Asha Puttaveer- ana. 2020. HACO: A Framework for Developing Human-AI Teaming. In13th Innovations in Software Engineering Conference (ISEC 2020). ACM, Jabalpur, India, 1–9. doi:10.1145/3385032.3385044

work page doi:10.1145/3385032.3385044 2020

[32] [32]

Upol Ehsan, Philipp Wintersberger, Q Vera Liao, Elizabeth Anne Watkins, Carina Manger, Hal Daumé III, Andreas Riener, and Mark O Riedl. 2022. Human-Centered Explainable AI (HCXAI): beyond opening the black-box of AI. InCHI conference on human factors in computing systems extended abstracts. 1–7

work page 2022

[33] [33]

Dylan Freedman. 2025. The Day ChatGPT Went Cold.The New York Times (2025). https://www.nytimes.com/2025/08/19/business/chatgpt-gpt-5-backlash- openai.html Accessed: 2025-09-04

work page 2025

[34] [34]

Gould, John Conti, and Todd Hovanyecz

John D. Gould, John Conti, and Todd Hovanyecz. 1983. Composing letters with a simulated listening typewriter.Commun. ACM26, 4 (April 1983), 295–308. doi:10.1145/2163.358100

work page doi:10.1145/2163.358100 1983

[35] [35]

Ryan Hoover. 2013. Product Hunt. https://www.producthunt.com/. Accessed: 2025-09-01

work page 2013

[36] [36]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces(CHI ’99). Asso- ciation for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/ 302979.303030

work page arXiv 1999

[37] [37]

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...

work page

[38] [38]

arXiv:2508.04482 [cs.AI] https://arxiv.org/abs/2508.04482

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use. arXiv:2508.04482 [cs.AI] https://arxiv.org/abs/2508.04482

work page arXiv

[39] [39]

Hutchins, James D

Edwin L. Hutchins, James D. Hollan, and Donald A. Norman. 1985. Direct Manipulation Interfaces. InUser Centered System Design: New Perspectives on Human-Computer Interaction, Donald A. Norman and Stephen W. Draper (Eds.). Lawrence Erlbaum Associates, Hillsdale, NJ, 87–124

work page 1985

[40] [40]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. 2024. AI Agents That Matter. arXiv:2407.01502 [cs.LG] https: //arxiv.org/abs/2407.01502

work page arXiv 2024

[42] [42]

Garry Kasparov. 2010. The Chess Master and the Computer.The New York Review of Books(2010). https://www.nybooks.com/articles/2010/02/11/the-chess- master-and-the-computer/

work page 2010

[43] [43]

Amershi, D

Harmanpreet Kaur, Alex C. Williams, and Walter S. Lasecki. 2019. Building Shared Mental Models between Humans and AI for Effective Collaboration. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland). Association for Computing Machinery. doi:10.1145/3290605. 3300643

work page doi:10.1145/3290605 2019

[44] [44]

Hancock, and Michael S

Pranav Khadpe, Ranjay Krishna, Li Fei-Fei, Jeffrey T. Hancock, and Michael S. Bernstein. 2020. Conceptual Metaphors Impact Perceptions of Human-AI Collab- oration.Proc. ACM Hum.-Comput. Interact.4, CSCW2, Article 163 (Oct. 2020), 26 pages. doi:10.1145/3415234

work page doi:10.1145/3415234 2020

[45] [45]

I’m Not Sure, But

Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. 2024. "I’m Not Sure, But... ": Examining the Im- pact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. InProceedings of the 7th ACM Conference on Fairness, Accountability, and Transparency (FAccT 2024)

work page 2024

[46] [46]

Ko, Brad A

Andrew J. Ko, Brad A. Myers, and Htet Htet Aung. 2004. Six Learning Barriers in End-User Programming Systems. InProceedings of the 2004 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 199–206. doi:10. 1109/VLHCC.2004.47

work page 2004

[47] [47]

Hui Li, Jiasheng Zhang, and Kun Huang. 2025. Meta-Analyzing the Trust- Performance Link in Collaboration: Moderating Effects of Conceptual and Con- textual Factors.Public Performance & Management Review48, 1 (2025), 1–34. arXiv:https://doi.org/10.1080/15309576.2024.2405839 doi:10.1080/15309576.2024. 2405839

work page doi:10.1080/15309576.2024.2405839 2025

[48] [48]

Q Vera Liao and Kush R Varshney. 2021. Human-centered explainable ai (xai): From algorithms to user experiences.arXiv preprint arXiv:2110.10790(2021)

work page arXiv 2021

[49] [49]

J. C. R. Licklider. 1960. Man-Computer Symbiosis.IRE Transactions on Human Factors in ElectronicsHFE-1, 1 (1960), 4–11. doi:10.1109/THFE2.1960.4503259

work page doi:10.1109/thfe2.1960.4503259 1960

[50] [50]

Henry Lieberman. 1997. Autonomous Interface Agents. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’97). ACM, Atlanta, Georgia, USA, 67–74. doi:10.1145/258549.258592

work page doi:10.1145/258549.258592 1997

[51] [51]

Henry Lieberman and Ted Selker. 1999. Agents for the User Interface. InProceed- ings of the ACM Conference on Human Factors in Computing Systems (CHI). Media Laboratory, Massachusetts Institute of Technology. https://web.media.mit.edu/ ~lieber/Publications/Agents_for_UI.pdf

work page 1999

[52] [52]

Deloitte Consulting LLP. 2025. AWS Marketplace: Care Finder Agent. https://aws. amazon.com/marketplace/pp/prodview-mau5avpo5xog6. https://aws.amazon. com/marketplace/pp/prodview-mau5avpo5xog6 Product listing for Care Finder Agent, an AI-powered healthcare navigation solution by Deloitte

work page 2025

[53] [53]

Pattie Maes. 1994. Agents that reduce work and information overload.Commun. ACM37, 7 (July 1994), 30–40. doi:10.1145/176789.176792

work page doi:10.1145/176789.176792 1994

[54] [54]

Manus AI. 2025. Manus: General AI agent that bridges mind and action. https: //manus.im/?index=1. Accessed: 2025-09-02

work page 2025

[55] [55]

David Maulsby, Saul Greenberg, and Richard Mander. 1993. Prototyping an intelligent agent through Wizard of Oz. InProceedings of the INTERACT’93 and CHI’93 Conference on Human Factors in Computing Systems. 277–284

work page 1993

[56] [56]

Merritt, Kian Boon Tan, Christopher Ong, Aswin Thomas, Teong Leong Chuah, and Kevin McGee

Tim R. Merritt, Kian Boon Tan, Christopher Ong, Aswin Thomas, Teong Leong Chuah, and Kevin McGee. 2011. Are artificial team-mates scapegoats in com- puter games. InProceedings of the ACM 2011 Conference on Computer Supported Cooperative Work(Hangzhou, China)(CSCW ’11). Association for Computing Machinery, New York, NY, USA, 685–688. doi:10.1145/1958824.1958945

work page doi:10.1145/1958824.1958945 2011

[57] [57]

Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence267 (2019), 1–38. doi:10.1016/j.artint.2018.07.007

work page doi:10.1016/j.artint.2018.07.007 2019

[58] [58]

2023.Spotify Debuts a New AI DJ, Right in Your Pocket

Spotify Newsroom. 2023.Spotify Debuts a New AI DJ, Right in Your Pocket. https://newsroom.spotify.com/2023-02-22/spotify-debuts-a-new-ai-dj- right-in-your-pocket/ Accessed: 2025-09-12

work page 2023

[59] [59]

Christopher Ong, Kevin McGee, and Teong Leong Chuah. 2012. Closing the human-AI team-mate gap: how changes to displayed information impact player behavior towards computer teammates. InProceedings of the 24th Australian Computer-Human Interaction Conference. 433–439

work page 2012

[60] [60]

2025.Introducing ChatGPT agent: bridging research and action

OpenAI. 2025.Introducing ChatGPT agent: bridging research and action. https: //openai.com/index/introducing-chatgpt-agent/ Accessed: 2025-09-02. Why Johnny Can’t Use Agents: Industry Aspirations vs. User Realities with AI Agent Software arXiv ’25, 2025, Pittsburgh, PA, United States

work page 2025

[61] [61]

OpenAI. 2025. Introducing Operator | OpenAI. https://openai.com/index/ introducing-operator/. Accessed: 2025-09-02

work page 2025

[62] [62]

Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, and Zhengyang Wu. 2024. WebCanvas: Benchmarking Web Agents in Online Environments. arXiv:2406.12373 [cs.CL] https://arxiv.org/abs/2406.12373

work page arXiv 2024

[63] [63]

Minjung Park, Jodi Forlizzi, and John Zimmerman. 2025. Exploring the Innovation Opportunities for Pre-trained Models. InProceedings of the 2025 ACM Designing Interactive Systems Conference (DIS ’25). 1973–2005. doi:10.1145/3715336.3735753

work page doi:10.1145/3715336.3735753 2025

[64] [64]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al . 2024. Androidworld: A dynamic benchmarking environment for au- tonomous agents.arXiv preprint arXiv:2405.14573(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

Matt Renner and Matt A.V. Chaban. 2025.601 real-world gen AI use cases from the world’s leading organizations. https://cloud.google.com/transform/101-real- world-generative-ai-use-cases-from-industry-leaders Published April 12, 2024; last updated April 9, 2025. Google Cloud

work page 2025

[67] [67]

Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, and Sanmi Koyejo. 2025. Measurement to Meaning: A Validity-Centered Framework for AI Evaluation. arXiv:2505.10573 [cs.CY] https://arxiv.org/abs/2505.10573

work page arXiv 2025

[68] [68]

Devansh Saxena, Ji-Youn Jung, Jodi Forlizzi, Kenneth Holstein, and John Zim- merman. 2025. AI Mismatches: Identifying Potential Algorithmic Harms Before AI Development. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–23. doi:10.1145/3706598.3714098

work page doi:10.1145/3706598.3714098 2025

[69] [69]

Ben Shneiderman. 1983. Direct manipulation: A step beyond programming languages.Computer16, 08 (1983), 57–69

work page 1983

[70] [70]

On human-centered artificial intelligence

Ben Shneiderman. 2022.Human-centered AI. Oxford University Press

work page 2022

[71] [71]

Ben Shneiderman and Pattie Maes. 1997. Direct manipulation vs. interface agents. Interactions4, 6 (Nov. 1997), 42–61. doi:10.1145/267505.267514

work page doi:10.1145/267505.267514 1997

[72] [72]

Yoav Shoham. 1993. Agent-oriented programming.Artificial Intelligence60, 1 (1993), 51–92. doi:10.1016/0004-3702(93)90034-9

work page doi:10.1016/0004-3702(93)90034-9 1993

[73] [73]

Learning Too Much About Me

Pradyumna Shome and Miuyin Marie Yong Wong. 2024. "Learning Too Much About Me": A User Study on the Security and Privacy of Generative AI Chatbots. InProceedings of the Twentieth Symposium on Usable Privacy and Security (SOUPS

work page 2024

[74] [74]

USENIX Association, Philadelphia, PA, USA

— Poster Session. USENIX Association, Philadelphia, PA, USA. https: //www.usenix.org/conference/soups2024/presentation/shome-poster Poster

work page

[75] [75]

Neville A Stanton. 2006. Hierarchical task analysis: Developments, applications, and extensions.Applied ergonomics37, 1 (2006), 55–79

work page 2006

[76] [76]

Leon Staufer, Mick Yang, Anka Reuel, and Stephen Casper. 2025. Audit Cards: Contextualizing AI Evaluations. arXiv:2504.13839 [cs.CY] https://arxiv.org/abs/ 2504.13839

work page arXiv 2025

[77] [77]

Blase Ur, Melwyn Pak Yong Ho, Stephen Brawner, Jiyun Lee, Sarah Mennicken, Noah Picard, Diane Schulze, and Michael L Littman. 2016. Trigger-action pro- gramming in the wild: An analysis of 200,000 ifttt recipes. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 3227–3231

work page 2016

[78] [78]

Hanna Wallach, Meera Desai, A Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P Alex Dow, et al. 2025. Position: Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2502.00561(2025)

work page arXiv 2025

[79] [79]

Alma Whitten and J Doug Tygar. 1999. Why Johnny Can’t Encrypt: A Usability Evaluation of PGP 5.0.. InUSENIX Security Symposium, Vol. 348. 169–184

work page 1999

[80] [80]

Wikipedia contributors. 2025. Amazon Alexa. https://en.wikipedia.org/wiki/ Amazon_Alexa. Accessed: 2025-09-11

work page 2025