arxiv: 2604.20070 · v1 · submitted 2026-04-22 · 💻 cs.HC · cs.AI· cs.CE

Recognition: unknown

Auditing and Controlling AI Agent Actions in Spreadsheets

Esmeralda Eufracio, Run Huang, Sadra Sabouri, Souti Chattopadhyay, Sujay Maladi, Sumit Gulwani, Zeinabsadat Saghi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:15 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CE

keywords AI agentsspreadsheetshuman-AI collaborationauditingoversightknowledge workuser controlinteractive execution

0 comments

The pith

An AI agent for spreadsheets called Pista breaks its work into auditable steps so users can inspect and redirect each decision while it runs, rather than only checking the final result.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Pista, an AI agent designed for spreadsheet tasks that records every action in visible cells and pauses for user input before proceeding. In controlled studies, participants who could intervene at each step caught mistakes that would have been hidden in a complete output, understood the underlying logic better, and felt they shared ownership of the result. The authors argue that this real-time control changes how people relate to the agent and the task, moving oversight from passive review to active co-execution. The work focuses on spreadsheets because each cell change directly alters a user-owned artifact, making hidden decisions especially costly. If the approach holds, it suggests that future AI tools in knowledge work should expose intermediate actions by default instead of optimizing solely for speed and final accuracy.

Core claim

Pista decomposes agent execution into discrete, auditable actions displayed directly in the spreadsheet; users can examine assumptions, spot errors, and redirect the agent at each step. A formative study and a within-subjects evaluation showed that this design improved task outcomes, user comprehension of the work, perceptions of the agent, and feelings of co-ownership compared with a baseline that delivered only the completed result.

What carries the argument

Pista's decomposition of execution into auditable, controllable actions that users can inspect and intervene on in real time before the next step runs.

If this is right

Users identify their own intent in the agent's intermediate choices and can correct deviations before they affect later cells.
Error detection improves because problems surface while the reasoning is still visible rather than after the output is finalized.
Participants report greater comprehension of the overall task and a stronger sense of shared responsibility for the final spreadsheet.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same real-time auditing pattern could be tested in other environments where intermediate states are hard to recover, such as document editing or data pipelines.
Designers of general-purpose AI agents might need to add explicit pause-and-review points even when the underlying model can run end-to-end without them.
Over time, repeated use of such interfaces could shift user expectations so that fully autonomous agents feel less trustworthy by default.

Load-bearing premise

That the advantages observed in the small lab studies arise specifically from allowing users to participate in decisions as they occur rather than from other study features or participant differences.

What would settle it

A follow-up study in which participants using only post-hoc review of a completed spreadsheet detect errors and report understanding at rates equal to or higher than those using Pista's step-by-step interface.

Figures

Figures reproduced from arXiv: 2604.20070 by Esmeralda Eufracio, Run Huang, Sadra Sabouri, Souti Chattopadhyay, Sujay Maladi, Sumit Gulwani, Zeinabsadat Saghi.

**Figure 1.** Figure 1: Pista is an AI agent for spreadsheets that decomposes its execution into traceable, steerable steps, illustrated here on a tax return preparation task. At each step, Pista provides an In-Situ Explanations of the action and surfaces the underlying formula and data range so users can verify correctness without inspecting individual cells. Users can probe the agent’s reasoning through follow-up questions and … view at source ↗

**Figure 2.** Figure 2: Pista decomposes AI agent execution into traceable, steerable steps. It begins by generating modifiable requirements 1 , which users can refine by adding 1a or removing 1b items. Pista then situates each step 2 within the plan 2a , highlights the affected spreadsheet range 2b , and shows progress via a node-based view 2c (current step highlighted). At each step, it displays the applied formula 3a , highlig… view at source ↗

**Figure 3.** Figure 3: Likert ratings box plots for Pista vs. Baseline (𝑁 = 15). Significance via two-sided Wilcoxon signed-rank test: ∗𝑝 < .05, ∗∗𝑝 < .01, ∗∗∗𝑝 < .001 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Number of qualitative codes in participants’ ex [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Per-session distributions across four measures: (A) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution, often buried within large volumes of intermediate reasoning and outputs: by the time users receive the output, all underlying decisions have already been made without their involvement. This lack of transparency leaves users unable to examine the agent's assumptions, identify errors before they propagate, or redirect execution when it deviates from their intent. The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable. Each decision the agent makes is recorded directly in cells that belong to and reflect on the user. We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step. A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow. Users identified their own intent reflected in the agent's actions, detected errors that post-hoc review would have failed to surface, and reported a sense of co-ownership over the resulting output. These findings indicate that meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pista breaks AI agent work in spreadsheets into intervenable steps and the N=16 study finds real-time control beats a plain baseline on comprehension and ownership, though the necessity claim overreaches.

read the letter

Pista decomposes an AI agent's spreadsheet actions into auditable steps that users can inspect and edit while the agent runs. The formative study with eight people shaped the interface, and the within-subjects test with sixteen participants showed better task results, earlier error detection, and stronger sense of co-ownership compared with a baseline agent that only delivered the final output for review. The design fits spreadsheets well because every cell change is permanent and visible, so intervening mid-process actually changes the artifact users own. That practical match and the mixed-methods data on both performance and subjective experience are the clearest contributions. The small samples and within-subjects design are obvious limits; order effects and participant selection could explain part of the difference. The bigger issue is the headline conclusion that meaningful oversight requires active participation rather than any improved post-hoc mechanism. The study only compared against a standard baseline agent, not against a version with full logs, queryable traces, or replay tools, so the necessity part rests on an untested assumption. The baseline itself is described at a high level, which makes it hard to isolate exactly what the intervention points added. This work is aimed at HCI researchers and tool builders who want concrete examples of agent oversight in productivity software. A reader already working on explainable agents or spreadsheet automation would pick up usable design patterns and some initial evidence, even if the claims need scaling back. It deserves peer review; the core system and user data are solid enough to get useful feedback on methods and framing.

Referee Report

3 major / 2 minor

Summary. The paper introduces Pista, a spreadsheet AI agent that decomposes multi-step execution into auditable and controllable actions, enabling users to inspect assumptions, detect errors, and intervene during the process rather than only after completion. It reports a formative study (N=8) followed by a within-subjects summative evaluation (N=16) comparing Pista to a baseline agent; results indicate improved task outcomes, user comprehension, error detection, and sense of ownership with Pista. The authors conclude that meaningful oversight of AI agents in knowledge work requires active participation in decisions as they are made, not merely improved post-hoc review mechanisms.

Significance. If the empirical results hold, the work provides concrete evidence that decomposing agent actions into intervention points can enhance user understanding and control in spreadsheet-based knowledge work, where process and artifact are tightly coupled. The mixed-methods design (qualitative insights plus outcome measures) offers a useful template for evaluating in-the-loop oversight. The paper's strength lies in grounding claims in direct user data rather than purely theoretical arguments.

major comments (3)

[Abstract] Abstract and Discussion: The headline claim that meaningful oversight 'requires not improved post-hoc review mechanisms, but active participation' is not supported by the study design. The summative evaluation (N=16) compares Pista only to a baseline agent described as standard post-hoc final-output review; no condition tests enhanced post-hoc mechanisms (e.g., full decision logs, queryable traces, or replay). This makes the necessity claim an extrapolation rather than direct evidence.
[§5 (Summative Evaluation)] §5 (Summative Evaluation): The baseline agent is characterized only as 'a baseline agent' without decomposition or explicit intervention points. To substantiate the active-participation advantage over post-hoc review, the paper should either detail the baseline's review capabilities or include a comparison arm with improved post-hoc features.
[Discussion] Discussion: The assertion that users 'detected errors that post-hoc review would have failed to surface' assumes the baseline represents the strongest possible post-hoc review, which is not demonstrated. Without that comparison, the error-detection benefit cannot be attributed specifically to active participation.

minor comments (2)

[Abstract] Abstract: Small sample sizes (N=8 formative, N=16 summative) and lack of detailed statistics (effect sizes, confidence intervals) in the summary leave the strength of quantitative claims unclear; add these in the abstract if present in the body.
[Methods] Methods: Clarify counterbalancing procedure, task order effects, and any statistical corrections applied in the within-subjects design to strengthen reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed review, which identifies important limitations in how our comparative claims are framed. We respond to each major comment below, indicating revisions that will be incorporated to qualify our statements more precisely while preserving the core contribution of the work.

read point-by-point responses

Referee: [Abstract] Abstract and Discussion: The headline claim that meaningful oversight 'requires not improved post-hoc review mechanisms, but active participation' is not supported by the study design. The summative evaluation (N=16) compares Pista only to a baseline agent described as standard post-hoc final-output review; no condition tests enhanced post-hoc mechanisms (e.g., full decision logs, queryable traces, or replay). This makes the necessity claim an extrapolation rather than direct evidence.

Authors: We agree that the phrasing in the abstract and discussion extrapolates from the observed benefits of active participation over a standard baseline. Our within-subjects study demonstrates improvements in task outcomes, comprehension, error detection, and ownership when users can intervene during execution versus reviewing only the final output. However, it does not include a condition with enhanced post-hoc tools. We will revise the abstract to state that the results indicate active participation provides meaningful benefits beyond standard post-hoc review, and will similarly qualify the discussion to avoid implying necessity over all possible improved post-hoc systems. revision: yes
Referee: [§5 (Summative Evaluation)] §5 (Summative Evaluation): The baseline agent is characterized only as 'a baseline agent' without decomposition or explicit intervention points. To substantiate the active-participation advantage over post-hoc review, the paper should either detail the baseline's review capabilities or include a comparison arm with improved post-hoc features.

Authors: We will expand the description of the baseline agent in §5 to explicitly detail its capabilities: users receive only the completed spreadsheet for review and editing, with no access to intermediate reasoning steps, assumptions, or real-time intervention points. This positions the baseline as representative of typical current post-hoc review in AI agent systems. Adding an additional experimental arm with enhanced post-hoc features (such as queryable traces) would require new data collection and is noted as future work. revision: yes
Referee: [Discussion] Discussion: The assertion that users 'detected errors that post-hoc review would have failed to surface' assumes the baseline represents the strongest possible post-hoc review, which is not demonstrated. Without that comparison, the error-detection benefit cannot be attributed specifically to active participation.

Authors: The qualitative findings include specific cases in which participants caught and corrected errors (e.g., incorrect assumptions about data ranges or formula logic) during Pista's step-by-step process that were not apparent from inspecting the final output. We will revise the discussion to present these observations as relative to the standard baseline condition rather than a universal claim. We will also add explicit language noting that the study does not compare against enhanced post-hoc mechanisms and that such tools might surface some errors, framing the error-detection benefit as evidence for the value of in-the-loop oversight in the tested setting. revision: yes

standing simulated objections not resolved

We cannot add a new experimental condition comparing Pista to enhanced post-hoc review mechanisms without conducting additional user studies, which lies outside the scope of a revision.

Circularity Check

0 steps flagged

No circularity: empirical HCI evaluation derives claims from independent user data

full rationale

The paper introduces Pista via system description and evaluates it through two independent user studies (formative N=8, summative within-subjects N=16) that collect task outcomes, comprehension metrics, error detection, and subjective ownership reports directly from participants. These results are compared against a described baseline agent and do not involve any equations, parameter fitting, self-referential modeling, or load-bearing self-citations that reduce the central claims back to the inputs by construction. The interpretive conclusion that active participation is required (rather than improved post-hoc review) is an extrapolation from the observed differences, not a definitional or fitted equivalence. The work is self-contained against external benchmarks of user-study methodology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that small-scale user studies validly demonstrate the superiority of active participation and that the Pista design is the causal factor.

axioms (1)

domain assumption Small-scale user studies (N=8 and N=16) can yield generalizable insights into human-AI interaction effectiveness
Conclusions about oversight, comprehension, and ownership are drawn directly from these participant numbers and study designs.

invented entities (1)

Pista no independent evidence
purpose: Spreadsheet AI agent that decomposes execution into auditable, controllable actions
Pista is introduced and evaluated as a new artifact in the paper.

pith-pipeline@v0.9.0 · 5607 in / 1348 out tokens · 57099 ms · 2026-05-10T00:15:13.752623+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 29 canonical work pages · 1 internal anchor

[1]

2023.The Future of Human Agency

Janna Anderson and Lee Rainie. 2023.The Future of Human Agency. Technical Report. Pew Research Center. https://www.pewresearch.org/internet/2023/02/ 24/the-future-of-human-agency/

2023
[2]

Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al
[3]

Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

Measuring Progress on Scalable Oversight for Large Language Models. arXiv:2211.03540 [cs.CL]

work page arXiv
[4]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making.Proceedings of the ACM on Human-Computer Interaction5, CSCW1 (2021), 1–21. doi:10.1145/3449287

work page doi:10.1145/3449287 2021
[5]

2025.How people use chatgpt

Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. 2025.How people use chatgpt. Technical Report. National Bureau of Economic Research

2025
[6]

Kai Chen, Chenglong Wang, and Steven M. Drucker. 2025. Dango: Mixed- Initiative Data Wrangling with Proactive Clarification. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’25). ACM, New York, NY, USA

2025
[7]

Ruijia Cheng, Titus Barik, Alan Leung, Fred Hohman, and Jeffrey Nichols. 2024. BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computa- tional Notebooks. InIEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’24). IEEE. doi:10.1109/VL/HCC60511.2024.00012

work page doi:10.1109/vl/hcc60511.2024.00012 2024
[8]

Michelene TH Chi, Nicholas De Leeuw, Mei-Hung Chiu, and Christian LaVancher
[9]

Eliciting self-explanations improves understanding.Cognitive science18, 3 (1994), 439–477

1994
[10]

Victoria Clarke and Virginia Braun. 2017. Thematic analysis.The journal of positive psychology12, 3 (2017), 297–298

2017
[11]

Giovanna Di Marzo Serugendo, Maria Assunta Cappelli, Gilles Falquet, Claudine Métral, Assane Wade, Sami Ghadfi, Anne-Françoise Cutting-Decelle, Ashley Caselli, and Graham Cutting. 2024. Streamlining tax and administrative docu- ment management with AI-powered intelligent document management system. Information15, 8 (2024), 461

2024
[12]

Philipp Eibl, Sadra Sabouri, and Souti Chattopadhyay. 2025. Exploring the Chal- lenges and Opportunities of AI-assisted Codebase Generation. In2025 IEEE Sym- posium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 241–252

2025
[13]

Zhu, and Saleema Amershi

Will Epperson, Gagan Bansal, Victor Dibia, Adam Fourney, Jack Gerrits, E. Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems.Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3713581

work page doi:10.1145/3706598.3713581 2025
[14]

K Anders Ericsson. 2017. Protocol analysis.A companion to cognitive science (2017), 425–432

2017
[15]

Rosanna Fanni, Valerie Eveline Steinkogler, Giulia Zampedri, and Jo Pierson. 2023. Enhancing Human Agency Through Redress in Artificial Intelligence Systems. AI & Society38, 2 (2023), 537–547. doi:10.1007/s00146-021-01183-x

work page doi:10.1007/s00146-021-01183-x 2023
[16]

KJ Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S Weld, Amy X Zhang, and Joseph Chee Chang. 2024. Cocoa: Co-planning and co-execution with ai agents.arXiv preprint arXiv:2412.10999(2024)

work page arXiv 2024
[17]

Gordon, Carina Negreanu, Nadia Polikarpova, Advait Sarkar, and Benjamin Zorn

Kasra Ferdowsi, Jack Williams, Ian Drosos, Andrew D. Gordon, Carina Negreanu, Nadia Polikarpova, Advait Sarkar, and Benjamin Zorn. 2023. ColDeco: An End User Spreadsheet Inspection Tool for AI-Generated Code. InIEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’23). IEEE, 82–91. doi:10.1109/VL-HCC57772.2023.00017

work page doi:10.1109/vl-hcc57772.2023.00017 2023
[18]

Ken Gu, Ruoxi Shang, Tim Althoff, Chenglong Wang, and S. Drucker. 2023. How Do Analysts Understand and Verify AI-Assisted Data Analyses?Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems(2023). doi:10. 1145/3613904.3642497

work page arXiv 2023
[19]

Evans Xu Han, Alice Qian, Haiyi Zhu, Hong Shen, Paul Pu Liang, and Jane Hsieh
[20]

InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25)

Morae: A Mixed-Initiative UI Agent that Pauses at Decision Points for User Input. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM, New York, NY, USA
[21]

Xinyi He, Gagan Bansal, and Saleema Amershi. 2025. Plan-Then-Execute: An Empirical Study of LLM Planning with Feedback. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’25). ACM, New York, NY, USA

2025
[22]

Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, et al. 2024. Infiagent-dabench: Evaluating agents on data analysis tasks.arXiv preprint arXiv:2401.05507(2024)

work page arXiv 2024
[23]

Inala, Chenglong Wang, Steven M

J. Inala, Chenglong Wang, Steven M. Drucker, Gonzalo Ramos, Victor Dibia, Nathalie Riche, Dave Brown, Dan Marshall, and Jianfeng Gao. 2024. Data Analysis in the Era of Generative AI.ArXivabs/2409.18475 (2024). doi:10.48550/arxiv.2409. 18475

work page doi:10.48550/arxiv.2409 2024
[24]

Majeed Kazemitabaar, Jack Williams, Ian Drosos, Tovi Grossman, Austin Z Henley, Carina Negreanu, and Advait Sarkar. 2024. Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition.Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (2024). doi:10.1145/3654777.3676345

work page doi:10.1145/3654777.3676345 2024
[25]

Syed Md Faisal Ali Khan and Y. Shehawy. 2025. Perceived AI Consumer-Driven Decision Integrity: Assessing Mediating Effect of Cognitive Load and Response Bias.Technologies13, 8 (2025). doi:10.3390/technologies13080374

work page doi:10.3390/technologies13080374 2025
[26]

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714(2023)

work page internal anchor Pith review arXiv 2023
[27]

John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80

2004
[28]

Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. 2023. SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=tfyr2zRVoK

2023
[29]

Jenny T Liang, Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Vu Le, Chris Parnin, Arjun Radhakrishna, Ashish Tiwari, Emerson Murphy-Hill, and Gustavo Soares. 2025. Tabletalk: Scaffolding spreadsheet development with a language agent.ACM Transactions on Computer-Human Interaction32, 6 (2025), 1–49

2025
[30]

Ruofei Lin, Tao Zhang, Bjoern Hartmann, Iolanda Leite, and Vinit Srinivasan
[31]

InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA) (UIST ’24)

VRCopilot: Authoring 3D Layouts with Generative AI Models in VR. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM, New York, NY, USA. doi:10.1145/3654777.3676376

work page doi:10.1145/3654777.3676376
[32]

Jiaju Ma, Lei Shi, Kenneth Robertsen, and Peggy Chi. 2025. AmbigChat: Interac- tive Hierarchical Clarification for Ambiguous Open-Domain Question Answering. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Sabouri et al. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM, New York, NY, USA

2025
[33]

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. 2024. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems37 (2024), 94871–94908

2024
[34]

McNutt, Chenglong Wang, Robert A

Andrew M. McNutt, Chenglong Wang, Robert A. Deline, and Steven M. Drucker
[35]

InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany)(CHI ’23)

On the Design of AI-Powered Code Assistants for Notebooks. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’23). ACM, New York, NY, USA. doi:10.1145/3544548.3580940

work page doi:10.1145/3544548.3580940
[36]

Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, et al. 2025. Magentic-ui: Towards human-in-the-loop agen- tic systems.arXiv preprint arXiv:2507.22358(2025)

work page arXiv 2025
[37]

1988.The psychology of everyday things.Basic books

Donald A Norman. 1988.The psychology of everyday things.Basic books

1988
[38]

Marine Pagliari, Valérian Chambon, and Bruno Berberian. 2022. What is new with Artificial Intelligence? Human-agent interactions through the lens of social agency.Frontiers in Psychology13 (2022). doi:10.3389/fpsyg.2022.954444

work page doi:10.3389/fpsyg.2022.954444 2022
[39]

Raja Parasuraman and Dietrich H. Manzey. 2010. Complacency and Bias in Human Use of Automation: An Attentional Integration.Human Factors52, 3 (2010), 381–410. doi:10.1177/0018720810376055

work page doi:10.1177/0018720810376055 2010
[40]

Raja Parasuraman and Victor Riley. 1997. Humans and Automation: Use, Misuse, Disuse, Abuse.Human Factors39, 2 (1997), 230–253. doi:10.1518/ 001872097778543886

1997
[41]

Savvas Petridis, Michael Terry, and Carrie J Cai. 2024. Promptinfuser: How tightly coupling ai and ui design impacts designers’ workflows. InProceedings of the 2024 ACM Designing Interactive Systems Conference. 743–756

2024
[42]

Joty, and Enamul Hoque

Mizanur Rahman, Amran Bhuiyan, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Ridwan Mahbub, Ahmed Masry, Shafiq R. Joty, and Enamul Hoque. 2025. LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions.ArXivabs/2510.04023 (2025). doi:10.48550/arxiv.2510.04023

work page doi:10.48550/arxiv.2510.04023 2025
[43]

Gregg Rothermel, Lixin Li, Christopher DuPuis, and Margaret Burnett. 1998. What you see is what you test: A methodology for testing form-based visual programs. Inicse. 198–207

1998
[44]

Aaron Springer and Steve Whittaker. 2019. Progressive disclosure: empirically motivated approaches to designing effective transparency. InProceedings of the 24th international conference on intelligent user interfaces. 107–120

2019
[45]

Tejas Srinivasan and Jesse Thomason. 2025. Adjust for trust: Mitigating trust- induced inappropriate reliance on ai assistance.arXiv preprint arXiv:2502.13321 (2025)

work page arXiv 2025
[46]

Hariharan Subramonyam, Roy Pea, Christopher Pondoc, Maneesh Agrawala, and Colleen Seifert. 2024. Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM, New York, NY, USA. doi:10.1145/3613904.3642754

work page doi:10.1145/3613904.3642754 2024
[47]

Hari Subramonyam, Roy Pea, Christopher Pondoc, Maneesh Agrawala, and Colleen Seifert. 2024. Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Articl...

work page arXiv 2024
[48]

Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang

Yuan Tian, Jonathan K. Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang. 2024. SQLucid: Grounding Natural Language Database Queries with Interactive Ex- planations. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM, New York, NY, USA. doi:10.1145/ 3654777.3676368

work page arXiv 2024
[49]

Bernstein, and Ranjay Krishna

Helena Vasconcelos, Matthew Jörke, Madeleine Grunde-McLaughlin, Tobias Gerstenberg, Michael S. Bernstein, and Ranjay Krishna. 2023. Explanations Can Reduce Overreliance on AI Systems During Decision-Making.Proceedings of the ACM on Human-Computer Interaction7, CSCW1 (2023), 1–38. doi:10.1145/3579605

work page doi:10.1145/3579605 2023
[50]

Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech.31 (2017), 841

2017
[51]

Chenglong Wang, John Thompson, and Bongshin Lee. 2025. Data Formulator 2: Iteratively Creating Rich Visualizations with AI. arXiv:2408.16119 [cs.HC]

work page arXiv 2025
[52]

Yunlong Wang, Shuyuan Shen, and Brian Y. Lim. 2023. RePrompt: Automatic Prompt Editing to Refine AI-Generative Art Towards Precise Expressions.Pro- ceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). doi:10.1145/3544548.3581402

work page doi:10.1145/3544548.3581402 2023
[53]

Liwenhan Xie, Chengbo Zheng, Haijun Xia, Huamin Qu, and Chen Zhu-Tian. 2024. Waitgpt: Monitoring and steering conversational llm agent in data analysis with on-the-fly code visualization. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14

2024
[54]

Liwenhan Xie, Chengbo Zheng, Haijun Xia, Huamin Qu, and Chen Zhu-Tian
[55]

InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA) (UIST ’24)

WaitGPT: Monitoring and Steering Conversational LLM Agent in Data Analysis with On-the-Fly Code Visualization. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA) (UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 119, 14 pages. doi:10.1145/3654777.3676374

work page doi:10.1145/3654777.3676374
[56]

DropPop: Designing Drop-to-Deploy Mechanisms with Bistable Scissors Structures,

Litao Yan, Jeffrey Tao, Lydia B. Chilton, and Andrew Head. 2025. Answering Developer Questions with Annotated Agent-Discovered Program Traces. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM, New York, NY, USA. doi:10.1145/3746059.3747652

work page doi:10.1145/3746059.3747652 2025
[57]

Ruiyan Zhu, Xi Cheng, Ke Liu, Brian Zhu, Daniel Jin, Neeraj Parihar, Zhoutian Xu, and Oliver Gao. 2025. Sheetmind: An end-to-end llm-powered multi-agent framework for spreadsheet automation.arXiv preprint arXiv:2506.12339(2025)

work page arXiv 2025