pith. machine review for the scientific record. sign in

arxiv: 2604.20070 · v1 · submitted 2026-04-22 · 💻 cs.HC · cs.AI· cs.CE

Recognition: unknown

Auditing and Controlling AI Agent Actions in Spreadsheets

Esmeralda Eufracio, Run Huang, Sadra Sabouri, Souti Chattopadhyay, Sujay Maladi, Sumit Gulwani, Zeinabsadat Saghi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:15 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CE
keywords AI agentsspreadsheetshuman-AI collaborationauditingoversightknowledge workuser controlinteractive execution
0
0 comments X

The pith

An AI agent for spreadsheets called Pista breaks its work into auditable steps so users can inspect and redirect each decision while it runs, rather than only checking the final result.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Pista, an AI agent designed for spreadsheet tasks that records every action in visible cells and pauses for user input before proceeding. In controlled studies, participants who could intervene at each step caught mistakes that would have been hidden in a complete output, understood the underlying logic better, and felt they shared ownership of the result. The authors argue that this real-time control changes how people relate to the agent and the task, moving oversight from passive review to active co-execution. The work focuses on spreadsheets because each cell change directly alters a user-owned artifact, making hidden decisions especially costly. If the approach holds, it suggests that future AI tools in knowledge work should expose intermediate actions by default instead of optimizing solely for speed and final accuracy.

Core claim

Pista decomposes agent execution into discrete, auditable actions displayed directly in the spreadsheet; users can examine assumptions, spot errors, and redirect the agent at each step. A formative study and a within-subjects evaluation showed that this design improved task outcomes, user comprehension of the work, perceptions of the agent, and feelings of co-ownership compared with a baseline that delivered only the completed result.

What carries the argument

Pista's decomposition of execution into auditable, controllable actions that users can inspect and intervene on in real time before the next step runs.

If this is right

  • Users identify their own intent in the agent's intermediate choices and can correct deviations before they affect later cells.
  • Error detection improves because problems surface while the reasoning is still visible rather than after the output is finalized.
  • Participants report greater comprehension of the overall task and a stronger sense of shared responsibility for the final spreadsheet.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same real-time auditing pattern could be tested in other environments where intermediate states are hard to recover, such as document editing or data pipelines.
  • Designers of general-purpose AI agents might need to add explicit pause-and-review points even when the underlying model can run end-to-end without them.
  • Over time, repeated use of such interfaces could shift user expectations so that fully autonomous agents feel less trustworthy by default.

Load-bearing premise

That the advantages observed in the small lab studies arise specifically from allowing users to participate in decisions as they occur rather than from other study features or participant differences.

What would settle it

A follow-up study in which participants using only post-hoc review of a completed spreadsheet detect errors and report understanding at rates equal to or higher than those using Pista's step-by-step interface.

Figures

Figures reproduced from arXiv: 2604.20070 by Esmeralda Eufracio, Run Huang, Sadra Sabouri, Souti Chattopadhyay, Sujay Maladi, Sumit Gulwani, Zeinabsadat Saghi.

Figure 1
Figure 1. Figure 1: Pista is an AI agent for spreadsheets that decomposes its execution into traceable, steerable steps, illustrated here on a tax return preparation task. At each step, Pista provides an In-Situ Explanations of the action and surfaces the underlying formula and data range so users can verify correctness without inspecting individual cells. Users can probe the agent’s reasoning through follow-up questions and … view at source ↗
Figure 2
Figure 2. Figure 2: Pista decomposes AI agent execution into traceable, steerable steps. It begins by generating modifiable requirements 1 , which users can refine by adding 1a or removing 1b items. Pista then situates each step 2 within the plan 2a , highlights the affected spreadsheet range 2b , and shows progress via a node-based view 2c (current step highlighted). At each step, it displays the applied formula 3a , highlig… view at source ↗
Figure 3
Figure 3. Figure 3: Likert ratings box plots for Pista vs. Baseline (𝑁 = 15). Significance via two-sided Wilcoxon signed-rank test: ∗𝑝 < .05, ∗∗𝑝 < .01, ∗∗∗𝑝 < .001 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Number of qualitative codes in participants’ ex [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-session distributions across four measures: (A) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution, often buried within large volumes of intermediate reasoning and outputs: by the time users receive the output, all underlying decisions have already been made without their involvement. This lack of transparency leaves users unable to examine the agent's assumptions, identify errors before they propagate, or redirect execution when it deviates from their intent. The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable. Each decision the agent makes is recorded directly in cells that belong to and reflect on the user. We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step. A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow. Users identified their own intent reflected in the agent's actions, detected errors that post-hoc review would have failed to surface, and reported a sense of co-ownership over the resulting output. These findings indicate that meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Pista, a spreadsheet AI agent that decomposes multi-step execution into auditable and controllable actions, enabling users to inspect assumptions, detect errors, and intervene during the process rather than only after completion. It reports a formative study (N=8) followed by a within-subjects summative evaluation (N=16) comparing Pista to a baseline agent; results indicate improved task outcomes, user comprehension, error detection, and sense of ownership with Pista. The authors conclude that meaningful oversight of AI agents in knowledge work requires active participation in decisions as they are made, not merely improved post-hoc review mechanisms.

Significance. If the empirical results hold, the work provides concrete evidence that decomposing agent actions into intervention points can enhance user understanding and control in spreadsheet-based knowledge work, where process and artifact are tightly coupled. The mixed-methods design (qualitative insights plus outcome measures) offers a useful template for evaluating in-the-loop oversight. The paper's strength lies in grounding claims in direct user data rather than purely theoretical arguments.

major comments (3)
  1. [Abstract] Abstract and Discussion: The headline claim that meaningful oversight 'requires not improved post-hoc review mechanisms, but active participation' is not supported by the study design. The summative evaluation (N=16) compares Pista only to a baseline agent described as standard post-hoc final-output review; no condition tests enhanced post-hoc mechanisms (e.g., full decision logs, queryable traces, or replay). This makes the necessity claim an extrapolation rather than direct evidence.
  2. [§5 (Summative Evaluation)] §5 (Summative Evaluation): The baseline agent is characterized only as 'a baseline agent' without decomposition or explicit intervention points. To substantiate the active-participation advantage over post-hoc review, the paper should either detail the baseline's review capabilities or include a comparison arm with improved post-hoc features.
  3. [Discussion] Discussion: The assertion that users 'detected errors that post-hoc review would have failed to surface' assumes the baseline represents the strongest possible post-hoc review, which is not demonstrated. Without that comparison, the error-detection benefit cannot be attributed specifically to active participation.
minor comments (2)
  1. [Abstract] Abstract: Small sample sizes (N=8 formative, N=16 summative) and lack of detailed statistics (effect sizes, confidence intervals) in the summary leave the strength of quantitative claims unclear; add these in the abstract if present in the body.
  2. [Methods] Methods: Clarify counterbalancing procedure, task order effects, and any statistical corrections applied in the within-subjects design to strengthen reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed review, which identifies important limitations in how our comparative claims are framed. We respond to each major comment below, indicating revisions that will be incorporated to qualify our statements more precisely while preserving the core contribution of the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Discussion: The headline claim that meaningful oversight 'requires not improved post-hoc review mechanisms, but active participation' is not supported by the study design. The summative evaluation (N=16) compares Pista only to a baseline agent described as standard post-hoc final-output review; no condition tests enhanced post-hoc mechanisms (e.g., full decision logs, queryable traces, or replay). This makes the necessity claim an extrapolation rather than direct evidence.

    Authors: We agree that the phrasing in the abstract and discussion extrapolates from the observed benefits of active participation over a standard baseline. Our within-subjects study demonstrates improvements in task outcomes, comprehension, error detection, and ownership when users can intervene during execution versus reviewing only the final output. However, it does not include a condition with enhanced post-hoc tools. We will revise the abstract to state that the results indicate active participation provides meaningful benefits beyond standard post-hoc review, and will similarly qualify the discussion to avoid implying necessity over all possible improved post-hoc systems. revision: yes

  2. Referee: [§5 (Summative Evaluation)] §5 (Summative Evaluation): The baseline agent is characterized only as 'a baseline agent' without decomposition or explicit intervention points. To substantiate the active-participation advantage over post-hoc review, the paper should either detail the baseline's review capabilities or include a comparison arm with improved post-hoc features.

    Authors: We will expand the description of the baseline agent in §5 to explicitly detail its capabilities: users receive only the completed spreadsheet for review and editing, with no access to intermediate reasoning steps, assumptions, or real-time intervention points. This positions the baseline as representative of typical current post-hoc review in AI agent systems. Adding an additional experimental arm with enhanced post-hoc features (such as queryable traces) would require new data collection and is noted as future work. revision: yes

  3. Referee: [Discussion] Discussion: The assertion that users 'detected errors that post-hoc review would have failed to surface' assumes the baseline represents the strongest possible post-hoc review, which is not demonstrated. Without that comparison, the error-detection benefit cannot be attributed specifically to active participation.

    Authors: The qualitative findings include specific cases in which participants caught and corrected errors (e.g., incorrect assumptions about data ranges or formula logic) during Pista's step-by-step process that were not apparent from inspecting the final output. We will revise the discussion to present these observations as relative to the standard baseline condition rather than a universal claim. We will also add explicit language noting that the study does not compare against enhanced post-hoc mechanisms and that such tools might surface some errors, framing the error-detection benefit as evidence for the value of in-the-loop oversight in the tested setting. revision: yes

standing simulated objections not resolved
  • We cannot add a new experimental condition comparing Pista to enhanced post-hoc review mechanisms without conducting additional user studies, which lies outside the scope of a revision.

Circularity Check

0 steps flagged

No circularity: empirical HCI evaluation derives claims from independent user data

full rationale

The paper introduces Pista via system description and evaluates it through two independent user studies (formative N=8, summative within-subjects N=16) that collect task outcomes, comprehension metrics, error detection, and subjective ownership reports directly from participants. These results are compared against a described baseline agent and do not involve any equations, parameter fitting, self-referential modeling, or load-bearing self-citations that reduce the central claims back to the inputs by construction. The interpretive conclusion that active participation is required (rather than improved post-hoc review) is an extrapolation from the observed differences, not a definitional or fitted equivalence. The work is self-contained against external benchmarks of user-study methodology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that small-scale user studies validly demonstrate the superiority of active participation and that the Pista design is the causal factor.

axioms (1)
  • domain assumption Small-scale user studies (N=8 and N=16) can yield generalizable insights into human-AI interaction effectiveness
    Conclusions about oversight, comprehension, and ownership are drawn directly from these participant numbers and study designs.
invented entities (1)
  • Pista no independent evidence
    purpose: Spreadsheet AI agent that decomposes execution into auditable, controllable actions
    Pista is introduced and evaluated as a new artifact in the paper.

pith-pipeline@v0.9.0 · 5607 in / 1348 out tokens · 57099 ms · 2026-05-10T00:15:13.752623+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    2023.The Future of Human Agency

    Janna Anderson and Lee Rainie. 2023.The Future of Human Agency. Technical Report. Pew Research Center. https://www.pewresearch.org/internet/2023/02/ 24/the-future-of-human-agency/

  2. [2]

    Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al

    Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al

  3. [3]

    Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

    Measuring Progress on Scalable Oversight for Large Language Models. arXiv:2211.03540 [cs.CL]

  4. [4]

    Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making.Proceedings of the ACM on Human-Computer Interaction5, CSCW1 (2021), 1–21. doi:10.1145/3449287

  5. [5]

    2025.How people use chatgpt

    Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. 2025.How people use chatgpt. Technical Report. National Bureau of Economic Research

  6. [6]

    Kai Chen, Chenglong Wang, and Steven M. Drucker. 2025. Dango: Mixed- Initiative Data Wrangling with Proactive Clarification. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’25). ACM, New York, NY, USA

  7. [7]

    Ruijia Cheng, Titus Barik, Alan Leung, Fred Hohman, and Jeffrey Nichols. 2024. BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computa- tional Notebooks. InIEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’24). IEEE. doi:10.1109/VL/HCC60511.2024.00012

  8. [8]

    Michelene TH Chi, Nicholas De Leeuw, Mei-Hung Chiu, and Christian LaVancher

  9. [9]

    Eliciting self-explanations improves understanding.Cognitive science18, 3 (1994), 439–477

  10. [10]

    Victoria Clarke and Virginia Braun. 2017. Thematic analysis.The journal of positive psychology12, 3 (2017), 297–298

  11. [11]

    Giovanna Di Marzo Serugendo, Maria Assunta Cappelli, Gilles Falquet, Claudine Métral, Assane Wade, Sami Ghadfi, Anne-Françoise Cutting-Decelle, Ashley Caselli, and Graham Cutting. 2024. Streamlining tax and administrative docu- ment management with AI-powered intelligent document management system. Information15, 8 (2024), 461

  12. [12]

    Philipp Eibl, Sadra Sabouri, and Souti Chattopadhyay. 2025. Exploring the Chal- lenges and Opportunities of AI-assisted Codebase Generation. In2025 IEEE Sym- posium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 241–252

  13. [13]

    Zhu, and Saleema Amershi

    Will Epperson, Gagan Bansal, Victor Dibia, Adam Fourney, Jack Gerrits, E. Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems.Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3713581

  14. [14]

    K Anders Ericsson. 2017. Protocol analysis.A companion to cognitive science (2017), 425–432

  15. [15]

    Rosanna Fanni, Valerie Eveline Steinkogler, Giulia Zampedri, and Jo Pierson. 2023. Enhancing Human Agency Through Redress in Artificial Intelligence Systems. AI & Society38, 2 (2023), 537–547. doi:10.1007/s00146-021-01183-x

  16. [16]

    KJ Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S Weld, Amy X Zhang, and Joseph Chee Chang. 2024. Cocoa: Co-planning and co-execution with ai agents.arXiv preprint arXiv:2412.10999(2024)

  17. [17]

    Gordon, Carina Negreanu, Nadia Polikarpova, Advait Sarkar, and Benjamin Zorn

    Kasra Ferdowsi, Jack Williams, Ian Drosos, Andrew D. Gordon, Carina Negreanu, Nadia Polikarpova, Advait Sarkar, and Benjamin Zorn. 2023. ColDeco: An End User Spreadsheet Inspection Tool for AI-Generated Code. InIEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’23). IEEE, 82–91. doi:10.1109/VL-HCC57772.2023.00017

  18. [18]

    Ken Gu, Ruoxi Shang, Tim Althoff, Chenglong Wang, and S. Drucker. 2023. How Do Analysts Understand and Verify AI-Assisted Data Analyses?Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems(2023). doi:10. 1145/3613904.3642497

  19. [19]

    Evans Xu Han, Alice Qian, Haiyi Zhu, Hong Shen, Paul Pu Liang, and Jane Hsieh

  20. [20]

    InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25)

    Morae: A Mixed-Initiative UI Agent that Pauses at Decision Points for User Input. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM, New York, NY, USA

  21. [21]

    Xinyi He, Gagan Bansal, and Saleema Amershi. 2025. Plan-Then-Execute: An Empirical Study of LLM Planning with Feedback. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’25). ACM, New York, NY, USA

  22. [22]

    Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, et al. 2024. Infiagent-dabench: Evaluating agents on data analysis tasks.arXiv preprint arXiv:2401.05507(2024)

  23. [23]

    Inala, Chenglong Wang, Steven M

    J. Inala, Chenglong Wang, Steven M. Drucker, Gonzalo Ramos, Victor Dibia, Nathalie Riche, Dave Brown, Dan Marshall, and Jianfeng Gao. 2024. Data Analysis in the Era of Generative AI.ArXivabs/2409.18475 (2024). doi:10.48550/arxiv.2409. 18475

  24. [24]

    Majeed Kazemitabaar, Jack Williams, Ian Drosos, Tovi Grossman, Austin Z Henley, Carina Negreanu, and Advait Sarkar. 2024. Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition.Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (2024). doi:10.1145/3654777.3676345

  25. [25]

    Syed Md Faisal Ali Khan and Y. Shehawy. 2025. Perceived AI Consumer-Driven Decision Integrity: Assessing Mediating Effect of Cognitive Load and Response Bias.Technologies13, 8 (2025). doi:10.3390/technologies13080374

  26. [26]

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714(2023)

  27. [27]

    John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80

  28. [28]

    Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. 2023. SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=tfyr2zRVoK

  29. [29]

    Jenny T Liang, Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Vu Le, Chris Parnin, Arjun Radhakrishna, Ashish Tiwari, Emerson Murphy-Hill, and Gustavo Soares. 2025. Tabletalk: Scaffolding spreadsheet development with a language agent.ACM Transactions on Computer-Human Interaction32, 6 (2025), 1–49

  30. [30]

    Ruofei Lin, Tao Zhang, Bjoern Hartmann, Iolanda Leite, and Vinit Srinivasan

  31. [31]

    InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA) (UIST ’24)

    VRCopilot: Authoring 3D Layouts with Generative AI Models in VR. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM, New York, NY, USA. doi:10.1145/3654777.3676376

  32. [32]

    Jiaju Ma, Lei Shi, Kenneth Robertsen, and Peggy Chi. 2025. AmbigChat: Interac- tive Hierarchical Clarification for Ambiguous Open-Domain Question Answering. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Sabouri et al. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM, New York, NY, USA

  33. [33]

    Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. 2024. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems37 (2024), 94871–94908

  34. [34]

    McNutt, Chenglong Wang, Robert A

    Andrew M. McNutt, Chenglong Wang, Robert A. Deline, and Steven M. Drucker

  35. [35]

    InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany)(CHI ’23)

    On the Design of AI-Powered Code Assistants for Notebooks. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’23). ACM, New York, NY, USA. doi:10.1145/3544548.3580940

  36. [36]

    Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, et al. 2025. Magentic-ui: Towards human-in-the-loop agen- tic systems.arXiv preprint arXiv:2507.22358(2025)

  37. [37]

    1988.The psychology of everyday things.Basic books

    Donald A Norman. 1988.The psychology of everyday things.Basic books

  38. [38]

    Marine Pagliari, Valérian Chambon, and Bruno Berberian. 2022. What is new with Artificial Intelligence? Human-agent interactions through the lens of social agency.Frontiers in Psychology13 (2022). doi:10.3389/fpsyg.2022.954444

  39. [39]

    Raja Parasuraman and Dietrich H. Manzey. 2010. Complacency and Bias in Human Use of Automation: An Attentional Integration.Human Factors52, 3 (2010), 381–410. doi:10.1177/0018720810376055

  40. [40]

    Raja Parasuraman and Victor Riley. 1997. Humans and Automation: Use, Misuse, Disuse, Abuse.Human Factors39, 2 (1997), 230–253. doi:10.1518/ 001872097778543886

  41. [41]

    Savvas Petridis, Michael Terry, and Carrie J Cai. 2024. Promptinfuser: How tightly coupling ai and ui design impacts designers’ workflows. InProceedings of the 2024 ACM Designing Interactive Systems Conference. 743–756

  42. [42]

    Joty, and Enamul Hoque

    Mizanur Rahman, Amran Bhuiyan, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Ridwan Mahbub, Ahmed Masry, Shafiq R. Joty, and Enamul Hoque. 2025. LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions.ArXivabs/2510.04023 (2025). doi:10.48550/arxiv.2510.04023

  43. [43]

    Gregg Rothermel, Lixin Li, Christopher DuPuis, and Margaret Burnett. 1998. What you see is what you test: A methodology for testing form-based visual programs. Inicse. 198–207

  44. [44]

    Aaron Springer and Steve Whittaker. 2019. Progressive disclosure: empirically motivated approaches to designing effective transparency. InProceedings of the 24th international conference on intelligent user interfaces. 107–120

  45. [45]

    Tejas Srinivasan and Jesse Thomason. 2025. Adjust for trust: Mitigating trust- induced inappropriate reliance on ai assistance.arXiv preprint arXiv:2502.13321 (2025)

  46. [46]

    Hariharan Subramonyam, Roy Pea, Christopher Pondoc, Maneesh Agrawala, and Colleen Seifert. 2024. Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM, New York, NY, USA. doi:10.1145/3613904.3642754

  47. [47]

    Hari Subramonyam, Roy Pea, Christopher Pondoc, Maneesh Agrawala, and Colleen Seifert. 2024. Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Articl...

  48. [48]

    Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang

    Yuan Tian, Jonathan K. Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang. 2024. SQLucid: Grounding Natural Language Database Queries with Interactive Ex- planations. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM, New York, NY, USA. doi:10.1145/ 3654777.3676368

  49. [49]

    Bernstein, and Ranjay Krishna

    Helena Vasconcelos, Matthew Jörke, Madeleine Grunde-McLaughlin, Tobias Gerstenberg, Michael S. Bernstein, and Ranjay Krishna. 2023. Explanations Can Reduce Overreliance on AI Systems During Decision-Making.Proceedings of the ACM on Human-Computer Interaction7, CSCW1 (2023), 1–38. doi:10.1145/3579605

  50. [50]

    Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech.31 (2017), 841

  51. [51]

    Chenglong Wang, John Thompson, and Bongshin Lee. 2025. Data Formulator 2: Iteratively Creating Rich Visualizations with AI. arXiv:2408.16119 [cs.HC]

  52. [52]

    Yunlong Wang, Shuyuan Shen, and Brian Y. Lim. 2023. RePrompt: Automatic Prompt Editing to Refine AI-Generative Art Towards Precise Expressions.Pro- ceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). doi:10.1145/3544548.3581402

  53. [53]

    Liwenhan Xie, Chengbo Zheng, Haijun Xia, Huamin Qu, and Chen Zhu-Tian. 2024. Waitgpt: Monitoring and steering conversational llm agent in data analysis with on-the-fly code visualization. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14

  54. [54]

    Liwenhan Xie, Chengbo Zheng, Haijun Xia, Huamin Qu, and Chen Zhu-Tian

  55. [55]

    InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA) (UIST ’24)

    WaitGPT: Monitoring and Steering Conversational LLM Agent in Data Analysis with On-the-Fly Code Visualization. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA) (UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 119, 14 pages. doi:10.1145/3654777.3676374

  56. [56]

    DropPop: Designing Drop-to-Deploy Mechanisms with Bistable Scissors Structures,

    Litao Yan, Jeffrey Tao, Lydia B. Chilton, and Andrew Head. 2025. Answering Developer Questions with Annotated Agent-Discovered Program Traces. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM, New York, NY, USA. doi:10.1145/3746059.3747652

  57. [57]

    Ruiyan Zhu, Xi Cheng, Ke Liu, Brian Zhu, Daniel Jin, Neeraj Parihar, Zhoutian Xu, and Oliver Gao. 2025. Sheetmind: An end-to-end llm-powered multi-agent framework for spreadsheet automation.arXiv preprint arXiv:2506.12339(2025)