arxiv: 2604.26258 · v2 · submitted 2026-04-29 · 💻 cs.CL · cs.LG

Recognition: unknown

FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

Hongyeon Yu , Young-Bum Kim , Yoon Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords LLM workflowsbilevel optimizationtextual gradientsautomated agent inductionworkflow optimizationprompt engineering

0 comments

The pith

LLM workflows can be induced automatically via bilevel optimization and textual gradients to match human-crafted performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper frames the creation of LLM workflows as a bilevel optimization problem to remove the need for hand-crafted pipelines and prompts. An outer loop adjusts the high-level structure of how calls are sequenced, while an inner loop refines each LLM component one by one through modular textual gradients that propagate feedback layer by layer. The resulting FlowBot method produces workflows that perform competitively with baselines built from human designs or other generation techniques, easing a key deployment bottleneck for task-specific agents.

Core claim

Workflow induction is formulated as a bilevel optimization problem in which the outer loop optimizes the high-level sketch of LLM call structure and the inner loop optimizes each individual call modularly by backpropagating textual gradients layer by layer; LLM workflows discovered this way perform competitively against strong baselines that use human-crafted or generated workflows.

What carries the argument

Bilevel optimization with an outer loop for workflow structure sketch and an inner loop for modular, layer-by-layer optimization of individual LLM calls using backpropagated textual gradients.

If this is right

Workflow design shifts from manual engineering to data-driven optimization for new tasks.
Individual LLM components can be refined independently without retraining the entire structure.
Agent systems become more scalable because pipeline creation no longer requires per-task human expertise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bilevel structure might extend to workflows that interleave LLM calls with external tools if textual interfaces are provided.
Applying the method to longer or more deeply nested workflows could test whether gradient propagation remains stable at scale.
Comparing induced workflows on tasks far from the training distribution would indicate how much human-like robustness is retained.

Load-bearing premise

Textual gradients can be reliably backpropagated through LLM calls in a modular, layer-by-layer fashion to optimize components without human intervention or extra supervision.

What would settle it

A comparison on held-out tasks where FlowBot-induced workflows show consistent and substantial underperformance relative to human-crafted baselines.

Figures

Figures reproduced from arXiv: 2604.26258 by Hongyeon Yu, Yoon Kim, Young-Bum Kim.

**Figure 1.** Figure 1: Overview of our FLOWBOT approach. The outer loop optimizes the workflow sketch W (the structure of LLM calls), while the inner loop optimizes each prompt θi via layer-wise textual backpropagation. Gradients gk are propagated backwards from the loss, analogous to backpropagation in neural networks. Example gradient snippets (in italics) are from a HotpotQA task where aK−1 produced a hallucination that propa… view at source ↗

**Figure 2.** Figure 2: Validation performance of the best workflow seen so far across a single training epoch. 3.2. Main Results Learning. To see whether our approach can learn from data, we show the “learning curve” of our method for all datasets in view at source ↗

**Figure 3.** Figure 3: Textual backpropagation example on HotpotQA during training. Forward (top): Step 4 correctly extracts “1846” (Bucknell’s founding year), but Step 5 hallucinates by incorrectly associating G.W. Leach with Kentucky (founded 1865) instead of Bucknell, rejecting the correct answer. Backward (middle): Textual gradients g6 → g5 → g4 propagate error attribution via chain rule. Prompt Update (bottom): Gradient g5 … view at source ↗

read the original abstract

LLM workflows, which coordinate structured calls to individual LLMs/agents to achieve a particular goal, offer a promising path towards building powerful AI systems that can tackle diverse tasks. However, existing approaches for building such workflows generally rely on human-crafted pipelines and prompts, which presents a substantial bottleneck in real world deployment. How can we automatically induce LLM-based agents and workflows in a data-driven way? This paper describes a simple data-driven approach for automatically inducing agents and LLM workflows. We formulate workflow induction as a bilevel optimization problem: an outer loop which optimizes a high-level sketch of the workflow (in particular how the LLM calls should be structured), and an inner loop which optimizes each individual LLM call one-by one. Both loops are optimized with ``textual gradients'' where for the inner loop we optimize each component in a modular way through ``backpropagating'' textual gradients layer-by-layer. We find that LLM workflows discovered through our \textsc{FlowBot} (work\textbf{flow} induction through \textbf{b}ilevel \textbf{o}ptimization and \textbf{t}extual gradients) approach performs competitively against strong baselines that make use of human-crafted or generated workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlowBot frames workflow induction as bilevel optimization with textual gradients, a clean new angle on automating LLM agents, but the sequential inner loop may not handle component dependencies well.

read the letter

The one thing to know is that this paper casts the creation of LLM workflows as a bilevel optimization problem. The outer loop tunes the high-level structure of how calls are arranged, while the inner loop refines each individual LLM call one by one through textual gradients that are backpropagated layer by layer. This is a fresh combination that moves past the usual human-crafted pipelines or basic generation baselines mentioned in the abstract.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a data-driven approach called FlowBot for inducing LLM workflows via bilevel optimization. An outer loop optimizes the workflow structure sketch, while an inner loop optimizes each LLM call sequentially using textual gradients backpropagated layer-by-layer. It asserts that the resulting workflows perform competitively against strong baselines that use human-crafted or generated workflows.

Significance. Should the approach prove effective, it would offer a valuable method for automating the creation of complex LLM-based systems, addressing the current reliance on manual design. The bilevel formulation with textual gradients represents a creative application of optimization concepts to LLM workflows, potentially enabling more scalable and adaptive AI agents without extensive human supervision.

major comments (2)

Abstract: The abstract states that LLM workflows discovered through FlowBot perform competitively but provides no quantitative results, metrics, experimental setup, or ablation details. This makes it impossible to assess whether the data supports the central claim.
Inner-loop description: The inner loop optimizes each LLM call one-by-one in a modular, sequential manner via layer-by-layer textual gradient backpropagation. Given that outputs of one component feed into the next, this approach risks not capturing inter-component dependencies; the manuscript should demonstrate effective propagation of corrective signals or provide comparisons to joint optimization methods.

minor comments (1)

Abstract: Consider adding a brief mention of key experimental outcomes or metrics to strengthen the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: Abstract: The abstract states that LLM workflows discovered through FlowBot perform competitively but provides no quantitative results, metrics, experimental setup, or ablation details. This makes it impossible to assess whether the data supports the central claim.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to immediately evaluate the central claims. The full manuscript reports quantitative results, including performance metrics and comparisons against human-crafted and generated baselines, in Sections 4 and 5. In the revised version we will update the abstract to include key quantitative findings (e.g., competitive accuracy or success-rate numbers) together with a concise statement of the experimental tasks and evaluation protocol. revision: yes
Referee: Inner-loop description: The inner loop optimizes each LLM call one-by-one in a modular, sequential manner via layer-by-layer textual gradient backpropagation. Given that outputs of one component feed into the next, this approach risks not capturing inter-component dependencies; the manuscript should demonstrate effective propagation of corrective signals or provide comparisons to joint optimization methods.

Authors: We appreciate the referee’s concern about inter-component dependencies. In the current formulation the textual gradient computed at a downstream component is back-propagated through the chain of LLM calls, so that the optimization of an upstream component receives a corrective signal that already reflects the downstream effect of its output. This sequential back-propagation is intended to capture the relevant dependencies without requiring a single joint optimization over the entire workflow. Nevertheless, we acknowledge that an explicit demonstration of signal propagation and a comparison against a joint-optimization baseline would strengthen the paper. We will add both a clarifying paragraph on the back-propagation mechanism and an ablation study contrasting the modular sequential approach with a joint inner-loop baseline in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent optimization loops

full rationale

The paper formulates workflow induction as bilevel optimization (outer sketch + inner per-LLM-call textual gradient updates) and reports competitive empirical performance against human-crafted baselines. No equations or derivations are presented that reduce a claimed result to its own fitted inputs by construction, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming is smuggled in. The central claim rests on experimental comparison rather than tautological re-expression of the method itself, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach implicitly relies on standard bilevel optimization concepts and the effectiveness of textual feedback in LLMs.

pith-pipeline@v0.9.0 · 5515 in / 1073 out tokens · 63503 ms · 2026-05-07T13:16:49.163072+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages

[1]

findings-emnlp.309/

URL https://aclanthology.org/2020. findings-emnlp.309/. Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines. 2024. Li, S., Raghuram, V . C., Khattab, O., Hirschberg, J., and Yu, Z. PAPILLON...

2020
[2]

Anderson, B

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long

work page doi:10.18653/v1/2025.naacl-long 2025
[3]

emnlp-main.1246/

URL https://aclanthology.org/2025. naacl-long.173/. Liu, H., Simonyan, K., and Yang, Y . Darts: Differentiable architecture search.arXiv preprint arXiv:1806.09055, 2018. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., et al. Self-refine: Iterative refinement with self- feedback.Advan...

work page doi:10.18653/v1/2024.emnlp-main 2025
[4]

2, 3, 10

URL https://aclanthology.org/2024. emnlp-main.525/. Opsahl-Ong, K., Ryan, M. J., Purtell, J., Broman, D., Potts, C., Zaharia, M., and Khattab, O. Optimizing instruc- tions and demonstrations for multi-stage language model programs.arXiv preprint arXiv:2406.11695, 2024b. Pham, H., Guan, M., Zoph, B., Le, Q., and Dean, J. Efficient neural architecture searc...

work page arXiv 2024
[5]

In: Vlachos, A., Augen- stein, I

PMLR, 2018. Pryzant, R., Iter, D., Li, J., Lee, Y ., Zhu, C., and Zeng, M. Automatic prompt optimization with “gradient de- scent” and beam search. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7957–7968, Singapore, December 2023. Association for Computational Ling...

work page doi:10.18653/v1/2023 2018
[6]

What the executor is doing well (to preserve)
[7]

text_gradient

What needs to change to improve metrics (if any) Respond in JSON: ‘‘‘json {{ "text_gradient": "Your comprehensive textual gradient here. Include both what works well (to preserve) and what needs improvement (with specific suggestions)." }} ‘‘‘ A.1.3.θ GRAD-BACKPROP : CHAINRULEGRADIENT This prompt propagates textual gradients backward through intermediate ...
[8]

What this step is doing well for downstream (to preserve)
[9]

text_gradient

What changes would improve downstream performance (if any) Respond in JSON: ‘‘‘json {{ "text_gradient": "Your comprehensive textual gradient here. Include both what works well (to preserve) and what needs improvement (with specific suggestions for downstream benefit)." }} ‘‘‘ A.2. Bilevel Optimization These prompts apply the computed textual gradients to ...
[10]

Address ALL identified issues from the gradients
[11]

Preserve patterns that weren’t criticized
[12]

Make the prompt generalizable (not overfitting to specific samples)
[13]

updated_prompt

Structure the prompt with these sections: - Task Description: What the agent does and its purpose - Input Format: Expected input structure - Approach: Step-by-step strategy and rules for processing - Output Format: Expected output structure Respond in JSON: ‘‘‘json {{ "updated_prompt": "The complete updated prompt text", "changes_made": ["Change 1", "Chan...
[14]

Follows the generation guideline above
[15]

Can handle all types of questions like the samples above (be general, not question-specific)
[16]

name": "DescriptiveExecutorName

Uses the specified tools appropriately (if any) ## Output Format ‘‘‘json {{ "name": "DescriptiveExecutorName", "type": "ToolExecutor" or "LLMExecutor", "description": "What this executor does", "prompt": "The system prompt for this executor..." }} ‘‘‘ ## Prompt Writing Guidelines - Be specific about the task and expected output - If using tools, explain w...

1905
[17]

Bucknell was founded in 1846

for targeted attribute retrieval. Step Description Type Change Before:W= [a 1, a2, a3, a4](1 tool) 1 Wikipedia search Tool – 2 Answer extraction LLM – 3 Verification LLM – 4 Final answer LLM – After:W ′ = [a ′ 1, a′ 2, a′ 3, a′ 4, a′ 5, a′ 6](2 tools) 1 Wikipedia search Tool reuse 2 Entity disambiguation LLM new 3 Targeted attribute retrieval Tool new 4 A...

1904