pith. sign in

arxiv: 2606.11290 · v1 · pith:MGXDZVC7new · submitted 2026-06-09 · 💻 cs.LG · cs.AI· cs.CL

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

Pith reviewed 2026-06-27 14:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords agentic workflowsmulti-agent systemsworkflow optimizationportfolio selectionquery adaptive matchingLLM agentsprecompute reuse
0
0 comments X

The pith

FlowBank builds a bank of complementary workflows and matches each query to the best one at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic workflow optimization currently trades off between expensive offline search for one workflow and expensive inference for a new workflow per query. The paper demonstrates that these approaches are complementary because workflows from offline search solve different subsets of queries. FlowBank addresses this by using three stages to generate diverse candidates, curate them into a compact portfolio, and match queries to the appropriate workflow using predicted utility. This precompute-and-reuse method delivers the highest average performance on five benchmarks while remaining cost-competitive.

Core claim

The central discovery is that building a compact portfolio of reusable workflows and selecting among them adaptively at inference time solves the coverage-cost tradeoff better than either committing to a single workflow or synthesizing one per query. The three-stage design produces complementary candidates through diversification, compresses them with curation, and routes queries via bipartite graph utility prediction, resulting in superior average scores across benchmarks at comparable cost.

What carries the argument

The FlowBank framework with its Diversifying, Curating, and Matching stages that together enable portfolio-based workflow selection.

If this is right

  • Offline diversification can produce workflows that cover different query subsets rather than redundant ones.
  • Curation compresses the pool into a small deployable portfolio with minimal redundancy.
  • Matching as edge-value prediction on a query-workflow graph allows low-cost assignment at inference time.
  • The portfolio approach improves average scores by 4.26% over automated baselines and 14.92% over handcrafted ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could reduce the need for expensive query-level generation in many practical deployments.
  • If the learned matching generalizes well, the bank size could be further reduced without loss of coverage.
  • This suggests searching for workflow diversity explicitly rather than just performance.

Load-bearing premise

A small set of precomputed workflows can cover the query distribution well enough that the cost of matching does not outweigh the benefits of reuse.

What would settle it

If experiments on the five benchmarks show that FlowBank does not achieve higher average scores than the strongest baseline or exceeds their costs.

Figures

Figures reproduced from arXiv: 2606.11290 by Chenghao Deng, Fangxu Yu, Furong Huang, Lingzhi Yuan, Mohammad Rostami, Souradip Chakraborty.

Figure 1
Figure 1. Figure 1: FlowBank turns workflows from one-shot solutions into reusable assets. Left: Rather than committing to a single universal workflow or generating a new workflow for every query, FlowBank builds a compact portfolio of complementary workflows offline and assigns each query to the member with the best predicted utility. Right: This portfolio view recovers query-level adaptivity without the full per-query gener… view at source ↗
Figure 2
Figure 2. Figure 2: Coverage on MATH for workflow sets built from AFlow candidates. Potential of Reusing Workflows through Task-level Search￾ing. Task-level workflow optimization explores many can￾didate workflows, but deployment ultimately uses only one. This raises a natural question: do the remaining workflows still provide reusable value as a set? To quantify this on a finite dataset D, we define the coverage of a workflo… view at source ↗
Figure 3
Figure 3. Figure 3: Performance–cost comparison on DROP between ScoreFlow and an oracle selec￾tor. Potential of Reducing Redundant Workflow Generation for Query Adaptation. To examine whether the extra in￾ference cost for query adaptation is always necessary, we use ScoreFlow (Wang et al., 2025c) as a representative query-level method, which uses DPO to train a meta-generator and de￾signs a customized workflow for each query … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of FlowBank. DiverseFlow first builds a diverse raw pool Ωraw through performance-oriented warm-up followed by coverage-oriented expansion. CuraFlow then selects a compact portfolio Ω ∗ that retains most attainable coverage while pruning redundant workflows. Given a new query, a bipartite query–workflow matcher predicts each portfolio member’s utility under the performance–cost trade-off and execu… view at source ↗
Figure 5
Figure 5. Figure 5: Performance–cost trade-off across all methods. FlowBank is on the Pareto frontier. Performance–Cost Trade-off. Beyond raw accuracy, FlowBank remains efficient. Its average inference cost is 1.65, below AFlow (GPT-4o) at 1.95 and ScoreFlow at 2.37, while its average performance is higher than both. AgentSquare is cheaper at 1.02, but it trails FlowBank by 6.70 points in average perfor￾mance, leaving it off … view at source ↗
Figure 6
Figure 6. Figure 6: Left: Impact of portfolio size k on AMC; Right: Impact of cost regularization weight λ on MMLU Pro. Hyperparameter Effects [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Workflow assignment distribution of FlowBank across datasets. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Coverage-k curves of CuraFlow across five benchmarks. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Controlled comparison of Coverage-k curves for AFlow and DiverseFlow (online cumulative), and CuraFlow (combinatorial best over DiverseFlow’s full candidate set) on MATH. F.2. Broader Impact FlowBank can have several positive practical impacts. By reusing a compact portfolio of high-utility workflows instead of generating or executing fully dynamic multi-agent systems for every query, it can substantially … view at source ↗
read the original abstract

Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query-level methods synthesize a new workflow per query at substantial inference cost. Our motivating analysis shows these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query-level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating one per instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time. Doing so requires solving three coupled problems: generating complementary rather than redundant candidates, compressing them into a small deployable portfolio, and assigning each query to the right workflow under a performance-cost trade-off. To this end, we present FlowBank, a three-stage framework for portfolio-based agentic workflow optimization. Diversifying proposes DiverseFlow to steer search toward under-covered queries and produce a high-coverage candidate pool. Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy. Matching casts deployment as edge-value prediction on a query-workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility. Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost-competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FlowBank, a three-stage framework (Diversifying via DiverseFlow, Curating via CuraFlow, and Matching via bipartite graph edge-value prediction) for agentic workflow optimization. It claims that offline search yields complementary workflows that can be compressed into a compact portfolio and adaptively matched to queries at inference time, yielding the highest average performance across five benchmarks while remaining cost-competitive, with relative gains of 4.26% over the strongest automated baseline and 14.92% over handcrafted baselines.

Significance. If the empirical claims hold under proper statistical controls and ablations, the work could shift the dominant paradigms in LLM-based multi-agent systems from single-workflow or per-query synthesis toward portfolio-based precompute-and-reuse, offering a practical middle ground between offline cost and inference efficiency.

major comments (3)
  1. [Experiments section (benchmark tables)] Experiments section (benchmark tables): the reported average scores and relative improvements (4.26% and 14.92%) are presented without error bars, standard deviations, or statistical significance tests across the five benchmarks; this directly undermines the central claim that FlowBank achieves the 'highest average score' among evaluated methods.
  2. [Matching stage description] Matching stage description: no information is given on the training distribution, data split, or hyperparameter selection procedure for the edge-value prediction model on the query-workflow bipartite graph; without this, it is impossible to rule out that the matching function was tuned on the same queries used for final evaluation, weakening the performance-cost trade-off claim.
  3. [Framework overview and § on ablations] Framework overview and § on ablations: the three-stage design is motivated by complementarity, yet no ablation isolates the contribution of Diversifying, Curating, or Matching individually; the central claim that the full portfolio approach is superior therefore rests on an unverified assumption that the stages are non-redundant.
minor comments (2)
  1. [Abstract and introduction] The abstract and introduction use 'complementary' and 'under-covered queries' without a precise quantitative definition or metric; adding an explicit coverage or diversity measure would improve clarity.
  2. [Figures] Figure captions for the motivating analysis and bipartite graph should explicitly state the number of workflows and queries visualized to allow readers to assess scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight important aspects for strengthening the empirical rigor and methodological transparency of the manuscript. We address each point below and will incorporate revisions to address the concerns.

read point-by-point responses
  1. Referee: Experiments section (benchmark tables): the reported average scores and relative improvements (4.26% and 14.92%) are presented without error bars, standard deviations, or statistical significance tests across the five benchmarks; this directly undermines the central claim that FlowBank achieves the 'highest average score' among evaluated methods.

    Authors: We agree that the absence of error bars and statistical tests weakens the strength of the average performance claims. In the revised manuscript, we will add standard deviations computed over multiple independent runs of the evaluation pipeline and include statistical significance tests (such as paired t-tests or Wilcoxon signed-rank tests) comparing FlowBank against baselines to support the reported relative gains. revision: yes

  2. Referee: Matching stage description: no information is given on the training distribution, data split, or hyperparameter selection procedure for the edge-value prediction model on the query-workflow bipartite graph; without this, it is impossible to rule out that the matching function was tuned on the same queries used for final evaluation, weakening the performance-cost trade-off claim.

    Authors: We will revise the Matching stage description to explicitly detail the training distribution (queries drawn from a held-out portion of each benchmark), the data split procedure (ensuring complete separation from the final evaluation queries), and the hyperparameter selection method (e.g., grid search or Bayesian optimization on a validation split). This will confirm that the edge-value prediction model was not tuned on evaluation data. revision: yes

  3. Referee: Framework overview and § on ablations: the three-stage design is motivated by complementarity, yet no ablation isolates the contribution of Diversifying, Curating, or Matching individually; the central claim that the full portfolio approach is superior therefore rests on an unverified assumption that the stages are non-redundant.

    Authors: We will add a new ablation subsection that systematically isolates each stage by evaluating variants with Diversifying, Curating, or Matching removed or replaced by simpler alternatives. These results will quantify the individual and synergistic contributions, directly verifying the non-redundancy of the three stages. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes an empirical three-stage framework (Diversifying, Curating, Matching) for agentic workflow optimization and reports benchmark performance gains. No mathematical derivation chain, first-principles predictions, or equations are presented in the provided text that reduce by construction to fitted parameters, self-definitions, or self-citation load-bearing premises. The motivating analysis and complementarity claims are stated at a descriptive level without tautological reductions, and results are framed as experimental outcomes rather than derived predictions. This is a standard self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated premise that the five benchmarks are representative of real query distributions and that the offline search procedure produces sufficiently complementary workflows; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.1-grok · 5845 in / 1240 out tokens · 15994 ms · 2026-06-27T14:11:38.392411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references

  1. [6]

    Step by step analysis

    **Programmer** Format:`programmer(problem: str, analysis: str ='None') -> str` Example:`result = await self.programmer(problem=problem, analysis="Step by step analysis")` Note: Writes and executes Python code, returns a string with the execution result. Here is a graph and the corresponding prompt (prompt only related to the custom method) that performed ...

  2. [11]

    total_cost

    The graph complexity should not exceed 8 nodes. **`__call__`SIGNATURE AND RETURN FORMAT (MUST follow exactly):** The`__call__`method signature MUST be: ```python async def __call__(self, problem: str): # ... your workflow logic ... return solution, self.llm.usage_tracker.get_summary()["total_cost"] ``` - Input:`problem`(str) -- the math problem text. - Re...

  3. [12]

    The input and instruction are directly concatenated ( instruction+input)

    **Custom** Format:`custom(input: str, instruction: str) -> str` Example:`solution = await self.custom(input=problem, instruction=prompt_custom.SOLVE_PROMPT)` Note: Returns the response string directly. The input and instruction are directly concatenated ( instruction+input). Placeholders are not supported. You MUST define the prompt variable (e.g., `SOLVE...

  4. [13]

    Selects the best from multiple candidates via voting

    **ScEnsemble** Format:`sc_ensemble(solutions: List[str], problem: str) -> str` Example:`best = await self.sc_ensemble(solutions=[sol1, sol2, sol3], problem=problem)` Note: Returns the best solution string directly. Selects the best from multiple candidates via voting

  5. [14]

    Generates step-by-step reasoning internally

    **AnswerGenerate** Format:`answer_generate(input: str) -> str` Example:`answer = await self.answer_generate(input=problem)` Note: Returns the answer string directly. Generates step-by-step reasoning internally

  6. [15]

    Analyze step by step and generate code

    **CustomCodeGenerate** Format:`custom_code_generate(problem: str, entry_point: str, instruction: str) -> str` Example:`code = await self.custom_code_generate(problem=problem, entry_point=entry_point, instruction="Analyze step by step and generate code.")` Note: Returns the code string directly. The instruction should encourage step-by-step thinking

  7. [16]

    Always return test_result[' solution'] (the improved solution)

    **Test** Format:`test(problem: str, solution: str, entry_point: str) -> dict with keys'result'(bool) and 'solution'(str)` Example:`test_result = await self.test(problem=problem, solution=solution, entry_point=entry_point )` Note: Modify the input solution solution with public test cases. Always return test_result[' solution'] (the improved solution). Use ...

  8. [17]

    Step by step analysis

    **Programmer** Format:`programmer(problem: str, analysis: str ='None') -> str` Example:`result = await self.programmer(problem=problem, analysis="Step by step analysis")` Note: Writes and executes Python code, returns a string with the execution result. Here is a graph and the corresponding prompt (prompt only related to the custom method) that performed ...

  9. [18]

    Only use the operators listed in the Operator Usage section above

    Do NOT create new operators. Only use the operators listed in the Operator Usage section above

  10. [19]

    Example:` self.custom = operator.Custom(self.llm)`

    All operators MUST be initialized in`__init__`with`self.llm`as the first argument. Example:` self.custom = operator.Custom(self.llm)`. Never call`operator.XXX()`without`self.llm`

  11. [20]

    Follow the exact call format for each operator as specified in the Operator Usage section

  12. [21]

    Loop iteration MUST <= 5 to avoid timeout

  13. [22]

    total_cost

    The graph complexity should not exceed 8 nodes. **`__call__`SIGNATURE AND RETURN FORMAT (MUST follow exactly):** The`__call__`method signature MUST be: ```python async def __call__(self, problem: str): # ... your workflow logic ... return solution, self.llm.usage_tracker.get_summary()["total_cost"] ``` - Input:`problem`(str) --- the math problem text. - R...