Recognition: unknown
FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients
Pith reviewed 2026-05-07 13:16 UTC · model grok-4.3
The pith
LLM workflows can be induced automatically via bilevel optimization and textual gradients to match human-crafted performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Workflow induction is formulated as a bilevel optimization problem in which the outer loop optimizes the high-level sketch of LLM call structure and the inner loop optimizes each individual call modularly by backpropagating textual gradients layer by layer; LLM workflows discovered this way perform competitively against strong baselines that use human-crafted or generated workflows.
What carries the argument
Bilevel optimization with an outer loop for workflow structure sketch and an inner loop for modular, layer-by-layer optimization of individual LLM calls using backpropagated textual gradients.
If this is right
- Workflow design shifts from manual engineering to data-driven optimization for new tasks.
- Individual LLM components can be refined independently without retraining the entire structure.
- Agent systems become more scalable because pipeline creation no longer requires per-task human expertise.
Where Pith is reading between the lines
- The same bilevel structure might extend to workflows that interleave LLM calls with external tools if textual interfaces are provided.
- Applying the method to longer or more deeply nested workflows could test whether gradient propagation remains stable at scale.
- Comparing induced workflows on tasks far from the training distribution would indicate how much human-like robustness is retained.
Load-bearing premise
Textual gradients can be reliably backpropagated through LLM calls in a modular, layer-by-layer fashion to optimize components without human intervention or extra supervision.
What would settle it
A comparison on held-out tasks where FlowBot-induced workflows show consistent and substantial underperformance relative to human-crafted baselines.
Figures
read the original abstract
LLM workflows, which coordinate structured calls to individual LLMs/agents to achieve a particular goal, offer a promising path towards building powerful AI systems that can tackle diverse tasks. However, existing approaches for building such workflows generally rely on human-crafted pipelines and prompts, which presents a substantial bottleneck in real world deployment. How can we automatically induce LLM-based agents and workflows in a data-driven way? This paper describes a simple data-driven approach for automatically inducing agents and LLM workflows. We formulate workflow induction as a bilevel optimization problem: an outer loop which optimizes a high-level sketch of the workflow (in particular how the LLM calls should be structured), and an inner loop which optimizes each individual LLM call one-by one. Both loops are optimized with ``textual gradients'' where for the inner loop we optimize each component in a modular way through ``backpropagating'' textual gradients layer-by-layer. We find that LLM workflows discovered through our \textsc{FlowBot} (work\textbf{flow} induction through \textbf{b}ilevel \textbf{o}ptimization and \textbf{t}extual gradients) approach performs competitively against strong baselines that make use of human-crafted or generated workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a data-driven approach called FlowBot for inducing LLM workflows via bilevel optimization. An outer loop optimizes the workflow structure sketch, while an inner loop optimizes each LLM call sequentially using textual gradients backpropagated layer-by-layer. It asserts that the resulting workflows perform competitively against strong baselines that use human-crafted or generated workflows.
Significance. Should the approach prove effective, it would offer a valuable method for automating the creation of complex LLM-based systems, addressing the current reliance on manual design. The bilevel formulation with textual gradients represents a creative application of optimization concepts to LLM workflows, potentially enabling more scalable and adaptive AI agents without extensive human supervision.
major comments (2)
- Abstract: The abstract states that LLM workflows discovered through FlowBot perform competitively but provides no quantitative results, metrics, experimental setup, or ablation details. This makes it impossible to assess whether the data supports the central claim.
- Inner-loop description: The inner loop optimizes each LLM call one-by-one in a modular, sequential manner via layer-by-layer textual gradient backpropagation. Given that outputs of one component feed into the next, this approach risks not capturing inter-component dependencies; the manuscript should demonstrate effective propagation of corrective signals or provide comparisons to joint optimization methods.
minor comments (1)
- Abstract: Consider adding a brief mention of key experimental outcomes or metrics to strengthen the abstract.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: Abstract: The abstract states that LLM workflows discovered through FlowBot perform competitively but provides no quantitative results, metrics, experimental setup, or ablation details. This makes it impossible to assess whether the data supports the central claim.
Authors: We agree that the abstract would benefit from greater specificity to allow readers to immediately evaluate the central claims. The full manuscript reports quantitative results, including performance metrics and comparisons against human-crafted and generated baselines, in Sections 4 and 5. In the revised version we will update the abstract to include key quantitative findings (e.g., competitive accuracy or success-rate numbers) together with a concise statement of the experimental tasks and evaluation protocol. revision: yes
-
Referee: Inner-loop description: The inner loop optimizes each LLM call one-by-one in a modular, sequential manner via layer-by-layer textual gradient backpropagation. Given that outputs of one component feed into the next, this approach risks not capturing inter-component dependencies; the manuscript should demonstrate effective propagation of corrective signals or provide comparisons to joint optimization methods.
Authors: We appreciate the referee’s concern about inter-component dependencies. In the current formulation the textual gradient computed at a downstream component is back-propagated through the chain of LLM calls, so that the optimization of an upstream component receives a corrective signal that already reflects the downstream effect of its output. This sequential back-propagation is intended to capture the relevant dependencies without requiring a single joint optimization over the entire workflow. Nevertheless, we acknowledge that an explicit demonstration of signal propagation and a comparison against a joint-optimization baseline would strengthen the paper. We will add both a clarifying paragraph on the back-propagation mechanism and an ablation study contrasting the modular sequential approach with a joint inner-loop baseline in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical method with independent optimization loops
full rationale
The paper formulates workflow induction as bilevel optimization (outer sketch + inner per-LLM-call textual gradient updates) and reports competitive empirical performance against human-crafted baselines. No equations or derivations are presented that reduce a claimed result to its own fitted inputs by construction, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming is smuggled in. The central claim rests on experimental comparison rather than tautological re-expression of the method itself, rendering the chain self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
findings-emnlp.309/
URL https://aclanthology.org/2020. findings-emnlp.309/. Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines. 2024. Li, S., Raghuram, V . C., Khattab, O., Hirschberg, J., and Yu, Z. PAPILLON...
2020
-
[2]
Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long
-
[3]
URL https://aclanthology.org/2025. naacl-long.173/. Liu, H., Simonyan, K., and Yang, Y . Darts: Differentiable architecture search.arXiv preprint arXiv:1806.09055, 2018. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., et al. Self-refine: Iterative refinement with self- feedback.Advan...
-
[4]
URL https://aclanthology.org/2024. emnlp-main.525/. Opsahl-Ong, K., Ryan, M. J., Purtell, J., Broman, D., Potts, C., Zaharia, M., and Khattab, O. Optimizing instruc- tions and demonstrations for multi-stage language model programs.arXiv preprint arXiv:2406.11695, 2024b. Pham, H., Guan, M., Zoph, B., Le, Q., and Dean, J. Efficient neural architecture searc...
-
[5]
In: Vlachos, A., Augen- stein, I
PMLR, 2018. Pryzant, R., Iter, D., Li, J., Lee, Y ., Zhu, C., and Zeng, M. Automatic prompt optimization with “gradient de- scent” and beam search. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7957–7968, Singapore, December 2023. Association for Computational Ling...
-
[6]
What the executor is doing well (to preserve)
-
[7]
text_gradient
What needs to change to improve metrics (if any) Respond in JSON: ‘‘‘json {{ "text_gradient": "Your comprehensive textual gradient here. Include both what works well (to preserve) and what needs improvement (with specific suggestions)." }} ‘‘‘ A.1.3.θ GRAD-BACKPROP : CHAINRULEGRADIENT This prompt propagates textual gradients backward through intermediate ...
-
[8]
What this step is doing well for downstream (to preserve)
-
[9]
text_gradient
What changes would improve downstream performance (if any) Respond in JSON: ‘‘‘json {{ "text_gradient": "Your comprehensive textual gradient here. Include both what works well (to preserve) and what needs improvement (with specific suggestions for downstream benefit)." }} ‘‘‘ A.2. Bilevel Optimization These prompts apply the computed textual gradients to ...
-
[10]
Address ALL identified issues from the gradients
-
[11]
Preserve patterns that weren’t criticized
-
[12]
Make the prompt generalizable (not overfitting to specific samples)
-
[13]
updated_prompt
Structure the prompt with these sections: - Task Description: What the agent does and its purpose - Input Format: Expected input structure - Approach: Step-by-step strategy and rules for processing - Output Format: Expected output structure Respond in JSON: ‘‘‘json {{ "updated_prompt": "The complete updated prompt text", "changes_made": ["Change 1", "Chan...
-
[14]
Follows the generation guideline above
-
[15]
Can handle all types of questions like the samples above (be general, not question-specific)
-
[16]
name": "DescriptiveExecutorName
Uses the specified tools appropriately (if any) ## Output Format ‘‘‘json {{ "name": "DescriptiveExecutorName", "type": "ToolExecutor" or "LLMExecutor", "description": "What this executor does", "prompt": "The system prompt for this executor..." }} ‘‘‘ ## Prompt Writing Guidelines - Be specific about the task and expected output - If using tools, explain w...
1905
-
[17]
Bucknell was founded in 1846
for targeted attribute retrieval. Step Description Type Change Before:W= [a 1, a2, a3, a4](1 tool) 1 Wikipedia search Tool – 2 Answer extraction LLM – 3 Verification LLM – 4 Final answer LLM – After:W ′ = [a ′ 1, a′ 2, a′ 3, a′ 4, a′ 5, a′ 6](2 tools) 1 Wikipedia search Tool reuse 2 Entity disambiguation LLM new 3 Targeted attribute retrieval Tool new 4 A...
1904
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.