Engineering Robustness into Personal Agents with the AI Workflow Store

Lillian Tsai; Mariana Raykova; Pierre Tholoniat; Roxana Geambasu; Trishita Tiwari; Wen Zhang; Wen Zhang (Google)

arxiv: 2605.10907 · v3 · pith:XXPEHMODnew · submitted 2026-05-11 · 💻 cs.CR · cs.AI

Engineering Robustness into Personal Agents with the AI Workflow Store

Roxana Geambasu , Mariana Raykova , Pierre Tholoniat , Trishita Tiwari , Lillian Tsai , Wen Zhang , Wen Zhang (Google) This is my paper

Pith reviewed 2026-05-13 03:05 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords AI agentssoftware engineeringrobustnessworkflowssecurityreliabilityagent systemsreuse

0 comments

The pith

AI agents must incorporate rigorous software engineering through reusable hardened workflows to achieve production-grade reliability and security.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI agents synthesize plans and execute actions rapidly in response to prompts, bypassing the iterative design, testing, and evaluation that underpin reliable software. This approach may leave users with fragile prototypes unsuitable for important uses. The paper proposes an AI Workflow Store containing pre-hardened, reusable workflows that agents can call upon for better performance. Amortizing the engineering effort across many users would make the added rigor feasible. The work highlights the flexibility-robustness tradeoff and calls for moving past purely on-the-fly methods.

Core claim

By focusing on rapid, real-time synthesis, AI agents are delivering improvised prototypes rather than systems fit for high-stakes scenarios. To address this, the integration of disciplined software engineering processes into the agentic loop is necessary to produce hardened and deterministically-constrained workflows that substantially outperform brittle on-the-fly results, amortized via reuse in an AI Workflow Store.

What carries the argument

The AI Workflow Store, envisioned as a collection of hardened and reusable agent workflows that provide greater reliability and security than on-the-fly tool chains.

If this is right

Hardened workflows would allow agents to invoke pre-vetted plans with deterministic constraints, reducing vulnerability to errors or attacks.
The cost of rigorous processes like adversarial evaluation and staged deployment would be spread across a broad user base.
Agents could transition from prototypes to production-grade systems suitable for high-stakes applications.
Research must tackle challenges in workflow design to balance flexibility and robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Community-driven curation of workflows could emerge, similar to package repositories, allowing continuous auditing and improvement.
Users might gain the ability to inspect and select workflows based on their verified properties, increasing transparency in agent behavior.
This model could support domain-specific workflow libraries for areas like finance or healthcare that demand high assurance.

Load-bearing premise

That the extra compute and time required for rigorous software engineering processes can be amortized through reuse across a broad user community without losing the responsiveness users expect.

What would settle it

Observing whether agents using workflows from the proposed store demonstrate measurably lower failure rates or security incidents compared to on-the-fly agents in controlled high-stakes simulations or real deployments.

Figures

Figures reproduced from arXiv: 2605.10907 by Lillian Tsai, Mariana Raykova, Pierre Tholoniat, Roxana Geambasu, Trishita Tiwari, Wen Zhang, Wen Zhang (Google).

**Figure 1.** Figure 1: Problem: The Agentic AI code-and-execute loop short-circuits well-trodden SE processes that are the foundations of the relatively reliable and secure programs and services we enjoy today. These failures arise from the substantial ask of the “on-the-fly” loop: in seconds or minutes, and often for pennies, it must synthesize and execute multi-step plans: sending emails, moving money, booking travel, editing … view at source ↗

**Figure 2.** Figure 2: The AI Workflow Store architecture [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: positions our vision within the spectrum defined by the tension between flexibility (ability to respond to any user need with the right functionality) and robustness (reliability and security of that functionality). Traditional software sits at one extreme: highly robust through careful engineering, but expensive to produce and limited in scope and flexibility. Purely on-the-fly agents sit at the other e… view at source ↗

read the original abstract

The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, adversarial evaluation, staged deployment, and more -- that have delivered the (relatively) reliable and secure systems we use today. By focusing on rapid, real-time synthesis, are AI agents effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them? This paper argues for the need to integrate rigorous SE processes into the agentic loop to produce production-grade, hardened, and deterministically-constrained agent *workflows* that substantially outperform the potentially brittle and vulnerable results of on-the-fly synthesis. Doing so may require extra compute and time, and if so, we must amortize the cost of rigor through reuse across a broad user community. We envision an *AI Workflow Store* that consists of hardened and reusable workflows that agents can invoke with far greater reliability and security than improvised tool chains. We outline the research challenges of this vision, which stem from a broader flexibility-robustness tension that we argue requires moving beyond the ``on-the-fly'' paradigm to navigate effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a position paper that clearly flags the risks of on-the-fly AI agent synthesis and proposes pre-engineered workflows as an alternative, but it stays at the level of argument without any supporting data or examples.

read the letter

The core point is that today's agents improvise plans and actions in real time, which skips the testing, adversarial checks, and staged rollout that make ordinary software more reliable. The authors suggest this leaves users with fragile prototypes even in high-stakes settings. They want to shift toward reusable, hardened workflows stored centrally so agents can invoke them instead of building from scratch each time. That framing of an AI Workflow Store is the freshest part of the piece, even if the underlying ideas about reuse and software engineering discipline have appeared in earlier agent and workflow work. The paper lays out the flexibility-robustness tension in straightforward terms and shows how current loops short-circuit established SE practices. The logic is consistent and draws on familiar benefits of iterative design and adversarial evaluation without overclaiming. No circular derivations or self-referential citations weaken the case. The main limitation is the complete absence of concrete illustrations, measurements, or even rough cost estimates. We are told that extra compute for rigor could be amortized through community reuse, but nothing shows whether that amortization would actually happen or how much responsiveness would be lost. Without a worked example or small-scale comparison, it is difficult to gauge whether the proposed shift would deliver the reliability gains claimed. This paper is aimed at researchers focused on agent security, reliability, and the application of software engineering to AI systems. Readers who care about moving agents beyond ad-hoc prototypes will find the questions useful as a starting point for discussion. It deserves a serious referee because the identified problem is substantive and the direction is worth community input, even though the paper itself is a vision statement rather than a result. I would send it to review.

Referee Report

0 major / 0 minor

Summary. The paper argues that the prevailing on-the-fly paradigm for personal AI agents, characterized by rapid plan synthesis and action execution, short-circuits disciplined software engineering processes including iterative design, rigorous testing, adversarial evaluation, and staged deployment. Consequently, it questions whether such agents are delivering improvised prototypes rather than robust systems for high-stakes scenarios. The authors propose integrating SE rigor to create production-grade agent workflows and envision an AI Workflow Store for reusable, hardened workflows that agents can invoke, outlining associated research challenges arising from the flexibility-robustness tension.

Significance. Should the proposed approach prove viable, it would represent a significant advancement in engineering reliable AI agents by adapting established software engineering methodologies to the agentic setting, potentially mitigating security and robustness issues. The manuscript is credited for grounding its vision in standard SE benefits and for framing the idea as an open research direction requiring further investigation into cost amortization and the flexibility-robustness trade-off, without overclaiming empirical support.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and insightful review, as well as their recommendation to accept the manuscript. Their summary accurately reflects our central argument that the dominant on-the-fly paradigm for personal AI agents circumvents established software engineering practices, and we appreciate the recognition that the work is framed as an open research direction without empirical overclaims.

Circularity Check

0 steps flagged

No significant circularity; position paper with independent argument

full rationale

The manuscript is a position paper advocating integration of software engineering processes into AI agent workflows via an AI Workflow Store to address flexibility-robustness tensions. It contains no equations, derivations, fitted parameters, or empirical predictions. The central claim is a high-level vision grounded in established SE principles (iterative design, testing, staged deployment) and does not reduce to self-citations, self-definitions, or renamed known results. No load-bearing steps are present that could exhibit circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that traditional software engineering processes reliably produce more robust systems than rapid synthesis, and on the invented concept of a shared Workflow Store whose costs can be amortized.

axioms (1)

domain assumption Iterative design, rigorous testing, adversarial evaluation, and staged deployment produce more reliable and secure systems than on-the-fly synthesis.
Invoked in the opening contrast between current agent loops and traditional SE success.

invented entities (1)

AI Workflow Store no independent evidence
purpose: Repository of hardened, reusable, deterministically-constrained agent workflows that can be invoked instead of synthesized on the fly.
Proposed as the central mechanism to amortize engineering costs and deliver robustness.

pith-pipeline@v0.9.0 · 5556 in / 1245 out tokens · 73061 ms · 2026-05-13T03:05:35.197055+00:00 · methodology

Review history (2 revisions) →

Engineering Robustness into Personal Agents with the AI Workflow Store

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)