Recognition: unknown
Auditing and Controlling AI Agent Actions in Spreadsheets
Pith reviewed 2026-05-10 00:15 UTC · model grok-4.3
The pith
An AI agent for spreadsheets called Pista breaks its work into auditable steps so users can inspect and redirect each decision while it runs, rather than only checking the final result.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pista decomposes agent execution into discrete, auditable actions displayed directly in the spreadsheet; users can examine assumptions, spot errors, and redirect the agent at each step. A formative study and a within-subjects evaluation showed that this design improved task outcomes, user comprehension of the work, perceptions of the agent, and feelings of co-ownership compared with a baseline that delivered only the completed result.
What carries the argument
Pista's decomposition of execution into auditable, controllable actions that users can inspect and intervene on in real time before the next step runs.
If this is right
- Users identify their own intent in the agent's intermediate choices and can correct deviations before they affect later cells.
- Error detection improves because problems surface while the reasoning is still visible rather than after the output is finalized.
- Participants report greater comprehension of the overall task and a stronger sense of shared responsibility for the final spreadsheet.
Where Pith is reading between the lines
- The same real-time auditing pattern could be tested in other environments where intermediate states are hard to recover, such as document editing or data pipelines.
- Designers of general-purpose AI agents might need to add explicit pause-and-review points even when the underlying model can run end-to-end without them.
- Over time, repeated use of such interfaces could shift user expectations so that fully autonomous agents feel less trustworthy by default.
Load-bearing premise
That the advantages observed in the small lab studies arise specifically from allowing users to participate in decisions as they occur rather than from other study features or participant differences.
What would settle it
A follow-up study in which participants using only post-hoc review of a completed spreadsheet detect errors and report understanding at rates equal to or higher than those using Pista's step-by-step interface.
Figures
read the original abstract
Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution, often buried within large volumes of intermediate reasoning and outputs: by the time users receive the output, all underlying decisions have already been made without their involvement. This lack of transparency leaves users unable to examine the agent's assumptions, identify errors before they propagate, or redirect execution when it deviates from their intent. The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable. Each decision the agent makes is recorded directly in cells that belong to and reflect on the user. We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step. A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow. Users identified their own intent reflected in the agent's actions, detected errors that post-hoc review would have failed to surface, and reported a sense of co-ownership over the resulting output. These findings indicate that meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Pista, a spreadsheet AI agent that decomposes multi-step execution into auditable and controllable actions, enabling users to inspect assumptions, detect errors, and intervene during the process rather than only after completion. It reports a formative study (N=8) followed by a within-subjects summative evaluation (N=16) comparing Pista to a baseline agent; results indicate improved task outcomes, user comprehension, error detection, and sense of ownership with Pista. The authors conclude that meaningful oversight of AI agents in knowledge work requires active participation in decisions as they are made, not merely improved post-hoc review mechanisms.
Significance. If the empirical results hold, the work provides concrete evidence that decomposing agent actions into intervention points can enhance user understanding and control in spreadsheet-based knowledge work, where process and artifact are tightly coupled. The mixed-methods design (qualitative insights plus outcome measures) offers a useful template for evaluating in-the-loop oversight. The paper's strength lies in grounding claims in direct user data rather than purely theoretical arguments.
major comments (3)
- [Abstract] Abstract and Discussion: The headline claim that meaningful oversight 'requires not improved post-hoc review mechanisms, but active participation' is not supported by the study design. The summative evaluation (N=16) compares Pista only to a baseline agent described as standard post-hoc final-output review; no condition tests enhanced post-hoc mechanisms (e.g., full decision logs, queryable traces, or replay). This makes the necessity claim an extrapolation rather than direct evidence.
- [§5 (Summative Evaluation)] §5 (Summative Evaluation): The baseline agent is characterized only as 'a baseline agent' without decomposition or explicit intervention points. To substantiate the active-participation advantage over post-hoc review, the paper should either detail the baseline's review capabilities or include a comparison arm with improved post-hoc features.
- [Discussion] Discussion: The assertion that users 'detected errors that post-hoc review would have failed to surface' assumes the baseline represents the strongest possible post-hoc review, which is not demonstrated. Without that comparison, the error-detection benefit cannot be attributed specifically to active participation.
minor comments (2)
- [Abstract] Abstract: Small sample sizes (N=8 formative, N=16 summative) and lack of detailed statistics (effect sizes, confidence intervals) in the summary leave the strength of quantitative claims unclear; add these in the abstract if present in the body.
- [Methods] Methods: Clarify counterbalancing procedure, task order effects, and any statistical corrections applied in the within-subjects design to strengthen reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review, which identifies important limitations in how our comparative claims are framed. We respond to each major comment below, indicating revisions that will be incorporated to qualify our statements more precisely while preserving the core contribution of the work.
read point-by-point responses
-
Referee: [Abstract] Abstract and Discussion: The headline claim that meaningful oversight 'requires not improved post-hoc review mechanisms, but active participation' is not supported by the study design. The summative evaluation (N=16) compares Pista only to a baseline agent described as standard post-hoc final-output review; no condition tests enhanced post-hoc mechanisms (e.g., full decision logs, queryable traces, or replay). This makes the necessity claim an extrapolation rather than direct evidence.
Authors: We agree that the phrasing in the abstract and discussion extrapolates from the observed benefits of active participation over a standard baseline. Our within-subjects study demonstrates improvements in task outcomes, comprehension, error detection, and ownership when users can intervene during execution versus reviewing only the final output. However, it does not include a condition with enhanced post-hoc tools. We will revise the abstract to state that the results indicate active participation provides meaningful benefits beyond standard post-hoc review, and will similarly qualify the discussion to avoid implying necessity over all possible improved post-hoc systems. revision: yes
-
Referee: [§5 (Summative Evaluation)] §5 (Summative Evaluation): The baseline agent is characterized only as 'a baseline agent' without decomposition or explicit intervention points. To substantiate the active-participation advantage over post-hoc review, the paper should either detail the baseline's review capabilities or include a comparison arm with improved post-hoc features.
Authors: We will expand the description of the baseline agent in §5 to explicitly detail its capabilities: users receive only the completed spreadsheet for review and editing, with no access to intermediate reasoning steps, assumptions, or real-time intervention points. This positions the baseline as representative of typical current post-hoc review in AI agent systems. Adding an additional experimental arm with enhanced post-hoc features (such as queryable traces) would require new data collection and is noted as future work. revision: yes
-
Referee: [Discussion] Discussion: The assertion that users 'detected errors that post-hoc review would have failed to surface' assumes the baseline represents the strongest possible post-hoc review, which is not demonstrated. Without that comparison, the error-detection benefit cannot be attributed specifically to active participation.
Authors: The qualitative findings include specific cases in which participants caught and corrected errors (e.g., incorrect assumptions about data ranges or formula logic) during Pista's step-by-step process that were not apparent from inspecting the final output. We will revise the discussion to present these observations as relative to the standard baseline condition rather than a universal claim. We will also add explicit language noting that the study does not compare against enhanced post-hoc mechanisms and that such tools might surface some errors, framing the error-detection benefit as evidence for the value of in-the-loop oversight in the tested setting. revision: yes
- We cannot add a new experimental condition comparing Pista to enhanced post-hoc review mechanisms without conducting additional user studies, which lies outside the scope of a revision.
Circularity Check
No circularity: empirical HCI evaluation derives claims from independent user data
full rationale
The paper introduces Pista via system description and evaluates it through two independent user studies (formative N=8, summative within-subjects N=16) that collect task outcomes, comprehension metrics, error detection, and subjective ownership reports directly from participants. These results are compared against a described baseline agent and do not involve any equations, parameter fitting, self-referential modeling, or load-bearing self-citations that reduce the central claims back to the inputs by construction. The interpretive conclusion that active participation is required (rather than improved post-hoc review) is an extrapolation from the observed differences, not a definitional or fitted equivalence. The work is self-contained against external benchmarks of user-study methodology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Small-scale user studies (N=8 and N=16) can yield generalizable insights into human-AI interaction effectiveness
invented entities (1)
-
Pista
no independent evidence
Reference graph
Works this paper leans on
-
[1]
2023.The Future of Human Agency
Janna Anderson and Lee Rainie. 2023.The Future of Human Agency. Technical Report. Pew Research Center. https://www.pewresearch.org/internet/2023/02/ 24/the-future-of-human-agency/
2023
-
[2]
Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al
Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al
-
[3]
Measuring Progress on Scalable Oversight for Large Language Models. arXiv:2211.03540 [cs.CL]
-
[4]
Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making.Proceedings of the ACM on Human-Computer Interaction5, CSCW1 (2021), 1–21. doi:10.1145/3449287
-
[5]
2025.How people use chatgpt
Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. 2025.How people use chatgpt. Technical Report. National Bureau of Economic Research
2025
-
[6]
Kai Chen, Chenglong Wang, and Steven M. Drucker. 2025. Dango: Mixed- Initiative Data Wrangling with Proactive Clarification. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’25). ACM, New York, NY, USA
2025
-
[7]
Ruijia Cheng, Titus Barik, Alan Leung, Fred Hohman, and Jeffrey Nichols. 2024. BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computa- tional Notebooks. InIEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’24). IEEE. doi:10.1109/VL/HCC60511.2024.00012
-
[8]
Michelene TH Chi, Nicholas De Leeuw, Mei-Hung Chiu, and Christian LaVancher
-
[9]
Eliciting self-explanations improves understanding.Cognitive science18, 3 (1994), 439–477
1994
-
[10]
Victoria Clarke and Virginia Braun. 2017. Thematic analysis.The journal of positive psychology12, 3 (2017), 297–298
2017
-
[11]
Giovanna Di Marzo Serugendo, Maria Assunta Cappelli, Gilles Falquet, Claudine Métral, Assane Wade, Sami Ghadfi, Anne-Françoise Cutting-Decelle, Ashley Caselli, and Graham Cutting. 2024. Streamlining tax and administrative docu- ment management with AI-powered intelligent document management system. Information15, 8 (2024), 461
2024
-
[12]
Philipp Eibl, Sadra Sabouri, and Souti Chattopadhyay. 2025. Exploring the Chal- lenges and Opportunities of AI-assisted Codebase Generation. In2025 IEEE Sym- posium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 241–252
2025
-
[13]
Will Epperson, Gagan Bansal, Victor Dibia, Adam Fourney, Jack Gerrits, E. Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems.Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3713581
-
[14]
K Anders Ericsson. 2017. Protocol analysis.A companion to cognitive science (2017), 425–432
2017
-
[15]
Rosanna Fanni, Valerie Eveline Steinkogler, Giulia Zampedri, and Jo Pierson. 2023. Enhancing Human Agency Through Redress in Artificial Intelligence Systems. AI & Society38, 2 (2023), 537–547. doi:10.1007/s00146-021-01183-x
- [16]
-
[17]
Gordon, Carina Negreanu, Nadia Polikarpova, Advait Sarkar, and Benjamin Zorn
Kasra Ferdowsi, Jack Williams, Ian Drosos, Andrew D. Gordon, Carina Negreanu, Nadia Polikarpova, Advait Sarkar, and Benjamin Zorn. 2023. ColDeco: An End User Spreadsheet Inspection Tool for AI-Generated Code. InIEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’23). IEEE, 82–91. doi:10.1109/VL-HCC57772.2023.00017
- [18]
-
[19]
Evans Xu Han, Alice Qian, Haiyi Zhu, Hong Shen, Paul Pu Liang, and Jane Hsieh
-
[20]
InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25)
Morae: A Mixed-Initiative UI Agent that Pauses at Decision Points for User Input. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM, New York, NY, USA
-
[21]
Xinyi He, Gagan Bansal, and Saleema Amershi. 2025. Plan-Then-Execute: An Empirical Study of LLM Planning with Feedback. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’25). ACM, New York, NY, USA
2025
- [22]
-
[23]
Inala, Chenglong Wang, Steven M
J. Inala, Chenglong Wang, Steven M. Drucker, Gonzalo Ramos, Victor Dibia, Nathalie Riche, Dave Brown, Dan Marshall, and Jianfeng Gao. 2024. Data Analysis in the Era of Generative AI.ArXivabs/2409.18475 (2024). doi:10.48550/arxiv.2409. 18475
-
[24]
Majeed Kazemitabaar, Jack Williams, Ian Drosos, Tovi Grossman, Austin Z Henley, Carina Negreanu, and Advait Sarkar. 2024. Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition.Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (2024). doi:10.1145/3654777.3676345
-
[25]
Syed Md Faisal Ali Khan and Y. Shehawy. 2025. Perceived AI Consumer-Driven Decision Integrity: Assessing Mediating Effect of Cognitive Load and Response Bias.Technologies13, 8 (2025). doi:10.3390/technologies13080374
-
[26]
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714(2023)
work page internal anchor Pith review arXiv 2023
-
[27]
John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80
2004
-
[28]
Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. 2023. SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=tfyr2zRVoK
2023
-
[29]
Jenny T Liang, Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Vu Le, Chris Parnin, Arjun Radhakrishna, Ashish Tiwari, Emerson Murphy-Hill, and Gustavo Soares. 2025. Tabletalk: Scaffolding spreadsheet development with a language agent.ACM Transactions on Computer-Human Interaction32, 6 (2025), 1–49
2025
-
[30]
Ruofei Lin, Tao Zhang, Bjoern Hartmann, Iolanda Leite, and Vinit Srinivasan
-
[31]
VRCopilot: Authoring 3D Layouts with Generative AI Models in VR. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM, New York, NY, USA. doi:10.1145/3654777.3676376
-
[32]
Jiaju Ma, Lei Shi, Kenneth Robertsen, and Peggy Chi. 2025. AmbigChat: Interac- tive Hierarchical Clarification for Ambiguous Open-Domain Question Answering. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Sabouri et al. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM, New York, NY, USA
2025
-
[33]
Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. 2024. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems37 (2024), 94871–94908
2024
-
[34]
McNutt, Chenglong Wang, Robert A
Andrew M. McNutt, Chenglong Wang, Robert A. Deline, and Steven M. Drucker
-
[35]
On the Design of AI-Powered Code Assistants for Notebooks. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’23). ACM, New York, NY, USA. doi:10.1145/3544548.3580940
- [36]
-
[37]
1988.The psychology of everyday things.Basic books
Donald A Norman. 1988.The psychology of everyday things.Basic books
1988
-
[38]
Marine Pagliari, Valérian Chambon, and Bruno Berberian. 2022. What is new with Artificial Intelligence? Human-agent interactions through the lens of social agency.Frontiers in Psychology13 (2022). doi:10.3389/fpsyg.2022.954444
-
[39]
Raja Parasuraman and Dietrich H. Manzey. 2010. Complacency and Bias in Human Use of Automation: An Attentional Integration.Human Factors52, 3 (2010), 381–410. doi:10.1177/0018720810376055
-
[40]
Raja Parasuraman and Victor Riley. 1997. Humans and Automation: Use, Misuse, Disuse, Abuse.Human Factors39, 2 (1997), 230–253. doi:10.1518/ 001872097778543886
1997
-
[41]
Savvas Petridis, Michael Terry, and Carrie J Cai. 2024. Promptinfuser: How tightly coupling ai and ui design impacts designers’ workflows. InProceedings of the 2024 ACM Designing Interactive Systems Conference. 743–756
2024
-
[42]
Mizanur Rahman, Amran Bhuiyan, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Ridwan Mahbub, Ahmed Masry, Shafiq R. Joty, and Enamul Hoque. 2025. LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions.ArXivabs/2510.04023 (2025). doi:10.48550/arxiv.2510.04023
-
[43]
Gregg Rothermel, Lixin Li, Christopher DuPuis, and Margaret Burnett. 1998. What you see is what you test: A methodology for testing form-based visual programs. Inicse. 198–207
1998
-
[44]
Aaron Springer and Steve Whittaker. 2019. Progressive disclosure: empirically motivated approaches to designing effective transparency. InProceedings of the 24th international conference on intelligent user interfaces. 107–120
2019
- [45]
-
[46]
Hariharan Subramonyam, Roy Pea, Christopher Pondoc, Maneesh Agrawala, and Colleen Seifert. 2024. Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM, New York, NY, USA. doi:10.1145/3613904.3642754
-
[47]
Hari Subramonyam, Roy Pea, Christopher Pondoc, Maneesh Agrawala, and Colleen Seifert. 2024. Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Articl...
-
[48]
Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang
Yuan Tian, Jonathan K. Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang. 2024. SQLucid: Grounding Natural Language Database Queries with Interactive Ex- planations. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM, New York, NY, USA. doi:10.1145/ 3654777.3676368
-
[49]
Helena Vasconcelos, Matthew Jörke, Madeleine Grunde-McLaughlin, Tobias Gerstenberg, Michael S. Bernstein, and Ranjay Krishna. 2023. Explanations Can Reduce Overreliance on AI Systems During Decision-Making.Proceedings of the ACM on Human-Computer Interaction7, CSCW1 (2023), 1–38. doi:10.1145/3579605
-
[50]
Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech.31 (2017), 841
2017
- [51]
-
[52]
Yunlong Wang, Shuyuan Shen, and Brian Y. Lim. 2023. RePrompt: Automatic Prompt Editing to Refine AI-Generative Art Towards Precise Expressions.Pro- ceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). doi:10.1145/3544548.3581402
-
[53]
Liwenhan Xie, Chengbo Zheng, Haijun Xia, Huamin Qu, and Chen Zhu-Tian. 2024. Waitgpt: Monitoring and steering conversational llm agent in data analysis with on-the-fly code visualization. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14
2024
-
[54]
Liwenhan Xie, Chengbo Zheng, Haijun Xia, Huamin Qu, and Chen Zhu-Tian
-
[55]
WaitGPT: Monitoring and Steering Conversational LLM Agent in Data Analysis with On-the-Fly Code Visualization. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA) (UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 119, 14 pages. doi:10.1145/3654777.3676374
-
[56]
DropPop: Designing Drop-to-Deploy Mechanisms with Bistable Scissors Structures,
Litao Yan, Jeffrey Tao, Lydia B. Chilton, and Andrew Head. 2025. Answering Developer Questions with Annotated Agent-Discovered Program Traces. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM, New York, NY, USA. doi:10.1145/3746059.3747652
- [57]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.