Recognition: no theorem link
Engineering Robustness into Personal Agents with the AI Workflow Store
Pith reviewed 2026-05-13 03:05 UTC · model grok-4.3
The pith
AI agents must incorporate rigorous software engineering through reusable hardened workflows to achieve production-grade reliability and security.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By focusing on rapid, real-time synthesis, AI agents are delivering improvised prototypes rather than systems fit for high-stakes scenarios. To address this, the integration of disciplined software engineering processes into the agentic loop is necessary to produce hardened and deterministically-constrained workflows that substantially outperform brittle on-the-fly results, amortized via reuse in an AI Workflow Store.
What carries the argument
The AI Workflow Store, envisioned as a collection of hardened and reusable agent workflows that provide greater reliability and security than on-the-fly tool chains.
If this is right
- Hardened workflows would allow agents to invoke pre-vetted plans with deterministic constraints, reducing vulnerability to errors or attacks.
- The cost of rigorous processes like adversarial evaluation and staged deployment would be spread across a broad user base.
- Agents could transition from prototypes to production-grade systems suitable for high-stakes applications.
- Research must tackle challenges in workflow design to balance flexibility and robustness.
Where Pith is reading between the lines
- Community-driven curation of workflows could emerge, similar to package repositories, allowing continuous auditing and improvement.
- Users might gain the ability to inspect and select workflows based on their verified properties, increasing transparency in agent behavior.
- This model could support domain-specific workflow libraries for areas like finance or healthcare that demand high assurance.
Load-bearing premise
That the extra compute and time required for rigorous software engineering processes can be amortized through reuse across a broad user community without losing the responsiveness users expect.
What would settle it
Observing whether agents using workflows from the proposed store demonstrate measurably lower failure rates or security incidents compared to on-the-fly agents in controlled high-stakes simulations or real deployments.
Figures
read the original abstract
The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, adversarial evaluation, staged deployment, and more -- that have delivered the (relatively) reliable and secure systems we use today. By focusing on rapid, real-time synthesis, are AI agents effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them? This paper argues for the need to integrate rigorous SE processes into the agentic loop to produce production-grade, hardened, and deterministically-constrained agent *workflows* that substantially outperform the potentially brittle and vulnerable results of on-the-fly synthesis. Doing so may require extra compute and time, and if so, we must amortize the cost of rigor through reuse across a broad user community. We envision an *AI Workflow Store* that consists of hardened and reusable workflows that agents can invoke with far greater reliability and security than improvised tool chains. We outline the research challenges of this vision, which stem from a broader flexibility-robustness tension that we argue requires moving beyond the ``on-the-fly'' paradigm to navigate effectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that the prevailing on-the-fly paradigm for personal AI agents, characterized by rapid plan synthesis and action execution, short-circuits disciplined software engineering processes including iterative design, rigorous testing, adversarial evaluation, and staged deployment. Consequently, it questions whether such agents are delivering improvised prototypes rather than robust systems for high-stakes scenarios. The authors propose integrating SE rigor to create production-grade agent workflows and envision an AI Workflow Store for reusable, hardened workflows that agents can invoke, outlining associated research challenges arising from the flexibility-robustness tension.
Significance. Should the proposed approach prove viable, it would represent a significant advancement in engineering reliable AI agents by adapting established software engineering methodologies to the agentic setting, potentially mitigating security and robustness issues. The manuscript is credited for grounding its vision in standard SE benefits and for framing the idea as an open research direction requiring further investigation into cost amortization and the flexibility-robustness trade-off, without overclaiming empirical support.
Simulated Author's Rebuttal
We thank the referee for their positive and insightful review, as well as their recommendation to accept the manuscript. Their summary accurately reflects our central argument that the dominant on-the-fly paradigm for personal AI agents circumvents established software engineering practices, and we appreciate the recognition that the work is framed as an open research direction without empirical overclaims.
Circularity Check
No significant circularity; position paper with independent argument
full rationale
The manuscript is a position paper advocating integration of software engineering processes into AI agent workflows via an AI Workflow Store to address flexibility-robustness tensions. It contains no equations, derivations, fitted parameters, or empirical predictions. The central claim is a high-level vision grounded in established SE principles (iterative design, testing, staged deployment) and does not reduce to self-citations, self-definitions, or renamed known results. No load-bearing steps are present that could exhibit circularity by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Iterative design, rigorous testing, adversarial evaluation, and staged deployment produce more reliable and secure systems than on-the-fly synthesis.
invented entities (1)
-
AI Workflow Store
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Accessed: 2026-05-01.URL:https://agentskil ls.io/home
AgentSkills.Agent Skills Overview – Agent Skills. Accessed: 2026-05-01.URL:https://agentskil ls.io/home
work page 2026
-
[2]
Accessed: 2026-05-01.URL:https : / / www
Amazon.Alexa Skills — Amazon.com. Accessed: 2026-05-01.URL:https : / / www . amazon . com / alexa-skills
work page 2026
-
[3]
Anthropic.Writing tools for agents. Accessed: May 2, 2026. Anthropic. 2024.URL:https://www. anthropic.com/engineering/writing-tools-for-agents
work page 2026
-
[4]
Anthropic.Claude Code — Anthropic’s agentic coding system. Accessed: 2026-04-30. 2026.URL:htt ps://www.anthropic.com/product/claude-code
work page 2026
-
[5]
Anthropic.Public repository for Agent Skills. Accessed: 2026-05-07. 2026
work page 2026
- [6]
-
[7]
Anthropic.The Complete Guide to Building Skills for Claude. Accessed: 2026-05-07. 2026
work page 2026
-
[8]
Anysphere, Inc.Cursor: The best way to code with AI. Accessed: 2026-05-07. 2026
work page 2026
-
[9]
AirGapAgent: Protecting Privacy-Conscious Conversational Agents
Eugene Bagdasarian, Ren Yi, Sahra Ghalebikesabi, Peter Kairouz, Marco Gruteser, Sewoong Oh, Borja Balle, and Daniel Ramage. “AirGapAgent: Protecting Privacy-Conscious Conversational Agents”. In: (2024). arXiv:2405.05175 [cs.CR]
-
[10]
AgentBound: Securing Execution Boundaries of AI Agents
Christoph B ¨uhler, Matteo Biagiola, Luca Di Grazia, and Guido Salvaneschi. “AgentBound: Securing Execution Boundaries of AI Agents”. In:Proceedings of the 34th ACM Joint European Software En- gineering Conference and Symposium on the Foundations of Software Engineering (FSE). V olume 3
-
[11]
Montreal, Canada: ACM, July 2026, page 24
work page 2026
- [12]
-
[13]
Secalign: Defending against prompt injection with preference optimization
Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. “Secalign: Defending against prompt injection with preference optimization”. In:Pro- ceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security. 2025, pages 2833–2847
work page 2025
-
[14]
LlamaFirewall: An open source guardrail system for building secure AI agents
Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Rashnil Chaturvedi, Wu Zhou, and Joshua Saxe. “LlamaFirewall: An open source guardrail system for b...
- [15]
-
[16]
Securing AI agents with information-flow control,
Manuel Costa, Boris K ¨opf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-B ´eguelin. “Securing AI Agents with Information-Flow Control”. In: (2025). arXiv:2505.23643 [cs.CR]
-
[17]
Penny Crosman.AI agents are going rogue: Here’s what banks can do about it. American Banker. Apr. 24, 2026.URL:https://www.americanbanker.com/news/ai-agents-are-going-rogue- heres-what-banks-can-do-about-it(visited on 04/29/2026)
work page 2026
-
[18]
Mike Curry.Agents Don’t Wait: How Agent-Based Systems Change Data Latency Requirements. Teal- ium. May 23, 2024.URL:https : / / tealium . com / blog / artificial - intelligence - ai / agents- dont- wait- how- agent- based- systems- change- data- latency- requirements/ (visited on 04/29/2026)
work page 2024
-
[19]
Defeating Prompt Injections by Design
Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tram `er. “Defeating Prompt Injections by Design”. In: (2025). arXiv:2503.18813 [cs.CR]
work page internal anchor Pith review arXiv 2025
-
[20]
Julia Flynn.OpenClaw Goes Rogue: The Security Crisis Unfolding in San Francisco’s AI Scene. The San Francisco Standard. Feb. 25, 2026.URL:https://sfstandard.com/2026/02/25/openclaw- goes-rogue/(visited on 04/29/2026)
work page 2026
-
[21]
Google.Jules FAQ. Accessed: 2026-04-30. 2026.URL:https://jules.google/
work page 2026
-
[22]
Google.Gemini Overview - Agent. Google.URL:https : / / gemini . google / overview / agent/ (visited on 04/29/2026)
work page 2026
-
[23]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”. In:arXiv(2023). eprint:2302.12173(cs.CR)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Grith Team.A GitHub Issue Title Compromised 4,000 Developer Machines. Grith Team. Mar. 5, 2026. URL:https : / / grith . ai / blog / clinejection - when - your - ai - tool - installs - another (visited on 05/04/2026)
work page 2026
-
[25]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. “A Survey on LLM-as-a-Judge”. In: (2025). arXiv:2411.15594 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Ikonomos Inc.Skyvern – AI-Powered Browser Automation for Any Website. Accessed: 2026-05-01. 2026.URL:https://www.skyvern.com/
work page 2026
-
[27]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. “Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations”. In:arXiv(2023). eprint:2312.06674(cs.CL)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Optimizing Agent Planning for Security and Autonomy
Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris K ¨opf, Tobias Nießen, Mark Russinovich, Shruti Tople, and Santiago Zanella-Beguelin. “Optimizing Agent Planning for Security and Autonomy”. In: The Fourteenth International Conference on Learning Representations. 2026
work page 2026
-
[29]
2024.URL:https://python.langchain
LangChain.Hugging Face prompt injection identification. 2024.URL:https://python.langchain. com/v0.1/docs/guides/productionization/safety/hugging_face_prompt_injection/
work page 2024
-
[30]
ACE: A Security Architecture for LLM-Integrated App Systems
Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita-Rotaru. “ACE: A Security Architecture for LLM-Integrated App Systems”. In: (2025). arXiv:2504.20984 [cs.CR]
-
[31]
Jon Martindale.Meta Security Researcher’s AI Agent Accidentally Deleted Her Emails. Accessed: 2026- 04-29. Feb. 2026.URL:https://www.pcmag.com/news/meta-security-researchers-opencla w-ai-agent-accidentally-deleted-her-emails
work page 2026
-
[32]
Microsoft Dynamics 365 Team.Measuring AI Agent Performance: Key Metrics and Benchmarks. Mi- crosoft. Feb. 4, 2026.URL:https://www.microsoft.com/en- us/dynamics- 365/blog/it- professional/2026/02/04/ai-agent-performance-measurement/(visited on 04/29/2026)
work page 2026
-
[33]
Fully Autonomous AI Agents Should Not be Developed
Margaret Mitchell, Avijit Ghosh, Alexandra Sasha Luccioni, and Giada Pistilli. “Fully Autonomous AI Agents Should Not be Developed”. In:CoRRabs/2502.02649 (2025). arXiv:2502.02649
-
[34]
NanoCo.NanoClaw: Secure Personal AI Agent. Accessed: 2026-04-30. 2026
work page 2026
- [35]
-
[36]
OpenAI.A practical guide to building AI agents. Accessed: May 2, 2026. OpenAI. 2024.URL:https: //openai.com/business/guides-and-resources/a-practical-guide-to-building-ai- agents/
work page 2026
- [37]
-
[38]
OpenAI.URL:https : / / chatgpt
OpenAI.ChatGPT Features - Agent. OpenAI.URL:https : / / chatgpt . com / features / agent/ (visited on 04/29/2026)
work page 2026
-
[39]
OpenClaw.URL:https://openclaw.ai/(visited on 04/29/2026)
OpenClaw.OpenClaw — Personal AI Assistant. OpenClaw.URL:https://openclaw.ai/(visited on 04/29/2026)
work page 2026
-
[40]
Formal Policy Enforcement for Real-World Agentic Systems
Nils Palumbo, Sarthak Choudhary, Jihye Choi, Prasad Chalasani, and Somesh Jha. “Policy Compiler for Secure Agentic Systems”. In: (2026). arXiv:2602.16708 [cs.CR]. 11
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Ignore Previous Prompt: Attack Techniques For Language Models
F ´abio Perez and Ian Ribeiro. “Ignore Previous Prompt: Attack Techniques For Language Models”. In: arXiv(2022). eprint:2211.09527(cs.CL)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
Do users write more insecure code with ai assistants?
Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. “Do users write more insecure code with ai assistants?” In:Proceedings of the 2023 ACM SIGSAC conference on computer and communications security. 2023, pages 2785–2799
work page 2023
-
[43]
Niels Provos.IronCurtain: A Personal AI Assistant Built Secure from the Ground Up. Niels Provos Blog. Feb. 26, 2026.URL:https : / / www . provos . org / p / ironcurtain - secure - personal - assistant/(visited on 03/25/2026)
work page 2026
-
[44]
Rashmi Ramesh.AI Agent Wipes Startup’s Data in 9-Second API Call. Business Insider. Apr. 28, 2026. URL:https://www.businessinsider.com/pocketos-cursor-ai-agent-deleted-producti on-database-startup-railway-2026-4(visited on 04/29/2026)
work page 2026
-
[45]
Eric Ravenscraft.The Best Google Assistant Skills to Use With Your Google Home. Accessed: 2026-05-
work page 2026
-
[46]
2017.URL:https://lifehacker.com/the- best- google- assistant- skills- to- use- with-your-googl-1792134538
work page 2017
-
[47]
Sherlock: Reliable and Efficient Agentic Workflow Execu- tion
Yeonju Ro, Haoran Qiu, ´I˜nigo Goiri, Rodrigo Fonseca, Ricardo Bianchini, Aditya Akella, Zhangyang Wang, Mattan Erez, and Esha Choukse. “Sherlock: Reliable and Efficient Agentic Workflow Execu- tion”. In:arXiv preprint arXiv:2511.00330(2025)
-
[48]
Kevin Roose.The Year of the Agent: How AI is Moving from Chatting to Doing. The New York Times. Mar. 19, 2026.URL:https : / / www . nytimes . com / 2026 / 03 / 19 / technology / ai - agents - uses.html(visited on 04/29/2026)
work page 2026
-
[49]
Shubham Saboo.The KPIs that actually matter for production AI agents. Google Cloud. Feb. 26, 2026. URL:https : / / cloud . google . com / transform / the - kpis - that - actually - matter - for - production-ai-agents(visited on 04/29/2026)
work page 2026
-
[50]
AC4A: Access Control for Agents
Reshabh K Sharma and Dan Grossman. “AC4A: Access Control for Agents”. In:arXiv preprint arXiv:2603.20933(2026)
-
[51]
Progent: Programmable privilege control for llm agents.arXiv preprint arXiv:2504.11703, 2025
Tianneng Shi, Jingxuan He, Zhun Wang, Linyu Wu, Hongwei Li, Wenbo Guo, and Dawn Song. “Pro- gent: Programmable Privilege Control for LLM Agents”. In: (2025). arXiv:2504.11703 [cs.CR]
-
[52]
SkillsMP.Agent Skills Marketplace - Claude, Codex & ChatGPT Skills — SkillsMP. Accessed: 2026- 05-07. 2026
work page 2026
-
[53]
An AI Agent Execution Environment to Safeguard User Data
Robert Stanley, Avi Verma, Lillian Tsai, Konstantinos Kallas, and Sam Kumar. “An AI Agent Execution Environment to Safeguard User Data”. In:arXiv preprint arXiv:2604.19657(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[54]
Ion Stoica, Matei Zaharia, Joseph Gonzalez, Ken Goldberg, Koushik Sen, Hao Zhang, Anastasios Angelopoulos, Shishir G Patil, Lingjiao Chen, Wei-Lin Chiang, et al. “Specifications: The miss- ing link to making the development of LLM systems an engineering discipline”. In:arXiv preprint arXiv:2412.05299(2024)
-
[55]
Contextual Agent Security: A Policy for Every Purpose
Lillian Tsai and Eugene Bagdasarian. “Contextual Agent Security: A Policy for Every Purpose”. In: Proceedings of the 20th Workshop on Hot Topics in Operating Systems. HotOS ’25. New York, NY , USA: Association for Computing Machinery, 2025, pages 100–112
work page 2025
-
[56]
Kenton Varda, Sunil Pai, and Ketan Gupta.Sandboxing AI agents, 100x faster. The Cloudflare Blog. Mar. 24, 2026.URL:https://blog.cloudflare.com/dynamic-workers/(visited on 03/27/2026)
work page 2026
- [57]
-
[58]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. “The In- struction Hierarchy: Training LLMs to Prioritize Privileged Instructions”. In:arXiv(2024). eprint: 2404.13208(cs.CR)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
2023.URL:https://simonwillison.net/ 2022/Sep/12/prompt-injection
Simon Willison.Prompt injection attacks against GPT-3. 2023.URL:https://simonwillison.net/ 2022/Sep/12/prompt-injection
work page 2023
-
[60]
Simon Willison.The Dual LLM pattern for building AI assistants that can resist prompt injection. 2024. URL:https://simonwillison.net/2023/Apr/25/dual-llm-pattern/
work page 2024
-
[61]
Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. “System-Level Defense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective”. In: (2024). arXiv:2409.19091 [cs.CR]
-
[62]
IsolateGPT: An Exe- cution Isolation Architecture for LLM-Based Agentic Systems
Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. “IsolateGPT: An Exe- cution Isolation Architecture for LLM-Based Agentic Systems”. In:Proceedings of the 32nd Network and Distributed System Security Symposium (NDSS). 2025
work page 2025
-
[63]
Privacy Reasoning in Ambiguous Contexts
Ren Yi, Octavian Suciu, Adria Gascon, Sarah Meiklejohn, Eugene Bagdasarian, and Marco Gruteser. “Privacy Reasoning in Ambiguous Contexts”. In:The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems. 2026
work page 2026
-
[64]
MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents
Jinhao Zhu, Kevin Tseng, Gil Vernik, Xiao Huang, Shishir G. Patil, Vivian Fang, and Raluca Ada Popa. “MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents”. In: (2025). arXiv: 2512.11147 [cs.CR]. 12
-
[65]
ASIDE: Architectural Separation of Instructions and Data in Lan- guage Models
Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, and Christoph H Lampert. “ASIDE: Architectural Separation of Instructions and Data in Lan- guage Models”. In:ICLR 2025 Workshop on Building Trust in Language Models and Applications. 13
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.