Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Buqiang Xu; Guozhou Zheng; Ningyu Zhang; Shuofei Qiao; Yijun Wang; Yi Zhong; Zifei Shan

arxiv: 2604.19667 · v2 · pith:WOKFU2YAnew · submitted 2026-04-21 · 💻 cs.CL · cs.AI· cs.CV· cs.LG· cs.MA

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Yi Zhong , Buqiang Xu , Yijun Wang , Zifei Shan , Shuofei Qiao , Guozhou Zheng , Ningyu Zhang This is my paper

Pith reviewed 2026-05-10 02:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LGcs.MA

keywords workflow generationnatural languagelarge language modelsagentic frameworkexecutable workflowsvisual workflowsbenchmarkindustrial automation

0 comments

The pith

Large language models capture high-level intent but struggle to produce correct, stable, executable visual workflows from natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Chat2Workflow, a benchmark built from real business cases, to test whether language models can automate the creation of visual workflows that run on industrial platforms. It finds that current models often grasp the overall goal yet fail to generate workflows that execute without errors, especially when requirements grow complex or shift over time. The authors introduce an agentic framework that reduces some errors through iterative correction and achieves modest gains in success rate. Even with this help, a large gap remains between model output and deployable automation. This positions the benchmark as a way to measure and drive progress toward reliable, low-cost workflow construction.

Core claim

While state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although the proposed agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation.

What carries the argument

The Chat2Workflow benchmark, consisting of real-world business workflow instances that can be transformed and deployed directly to platforms such as Dify and Coze, paired with an agentic framework that iteratively corrects execution errors.

If this is right

Improved techniques will be required to handle multi-round revisions and stability checks for complex workflows.
The benchmark supplies a concrete evaluation set for measuring progress in automated workflow generation.
Agentic iteration provides partial relief from execution errors but does not close the gap for industrial use.
Development of visual workflows will remain manual and costly until models improve on the identified failure modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could accelerate specialized model training for workflow tasks in the same way coding benchmarks advanced code generation.
Similar evaluation setups might apply to other automation areas such as robotic process automation or business process modeling.
Persistent execution gaps point to the need for tighter integration between language models and workflow execution environments during generation.

Load-bearing premise

The collected real-world business workflows are representative of practical industrial needs and that generated workflows can be transformed and directly deployed to platforms such as Dify and Coze without loss of intended functionality.

What would settle it

An LLM or agentic system that achieves a high resolve rate on complex or changing-requirement instances in Chat2Workflow, with the resulting workflows executing correctly and stably after deployment to Dify or Coze.

Figures

Figures reproduced from arXiv: 2604.19667 by Buqiang Xu, Guozhou Zheng, Ningyu Zhang, Shuofei Qiao, Yijun Wang, Yi Zhong, Zifei Shan.

**Figure 1.** Figure 1: An example task in Chat2Workflow, which features realistic, variable natural-language instruction inputs and produces outputs that can be directly transformed and integrated into real-world workflow platforms ( e.g., Dify and Coze). industry experience suggest that agentic workflows are better suited for reliable and controllable industrial use (Shi et al., 2025). Recent interviews (Pan et al., 2025) show… view at source ↗

**Figure 2.** Figure 2: Distribution of task types in Chat2Workflow. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Chat2Workflow benchmark construction and evaluation framework. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Performance degradation across dialogue rounds. We show the Pass Rate and Resolve Rate for all 15 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Bad case analysis for the StudyPlanner task. We compare outputs from three representative models: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for evaluating pass rate [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for evaluating resolve rate [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: The Dify workflow generated by GPT-5.2 in the second round of the Studyplanner task. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: The Coze workflow generated by GPT-5.2 in the second round of the Studyplanner task. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve -- making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic baseline to improve performance. The benchmark is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially given complex and evolving requirements. Although our agentic baseline yields up to 6.05% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new benchmark for natural language to executable visual workflow generation drawn from real business cases, but the abstract gives too few details on size, metrics, and validation to judge how well the results hold up.

read the letter

The paper's main contribution is Chat2Workflow, a benchmark built from real-world business workflows for testing whether LLMs can produce deployable visual workflows from natural language, plus an agentic framework that improves resolve rates by up to 5.34%. It also releases the code publicly. This targets a clear industrial pain point where workflows are still mostly hand-crafted for platforms like Dify and Coze, and the motivation to move beyond manual engineering is straightforward and relevant.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Chat2Workflow benchmark for generating executable visual workflows from natural language, constructed from real-world business workflows deployable to platforms like Dify and Coze. It demonstrates that state-of-the-art LLMs struggle to produce correct, stable, and executable workflows, particularly with complex or evolving requirements, and proposes an agentic framework achieving up to 5.34% improvement in resolve rate, while noting a persistent gap for industrial applications. The code is made publicly available.

Significance. If the benchmark is representative of industrial workflows and the evaluation is rigorous with verifiable lossless deployment, this work would be significant for advancing research in LLM-based automation of visual workflows. It provides a new benchmark targeting a practical gap in manual workflow engineering and highlights limitations of current models, with the open-sourced code supporting reproducibility and community extension.

major comments (2)

[Abstract] The central claim regarding LLM struggles and the 'remaining real-world gap' depends on the benchmark being representative of practical industrial workflows and generated outputs being transformable without loss of functionality. However, the abstract provides no details on the collection protocol, benchmark size, diversity statistics, coverage of requirement changes, or empirical deployment success rates to Dify and Coze.
[§5 (Experiments)] The reported 5.34% resolve rate gains and 'concrete struggles' are presented without specifying the benchmark size, exact definition and computation of resolve rate, evaluation protocol, or categorization of errors. This prevents a robust assessment of whether the agentic framework's improvements and the identified limitations are statistically meaningful and generalizable.

minor comments (1)

[Abstract] Clarification on the definition of 'resolve rate' and how the agentic framework is implemented would improve the abstract's standalone readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address each of the major comments below and commit to revising the manuscript accordingly to enhance clarity and provide the requested details.

read point-by-point responses

Referee: [Abstract] The central claim regarding LLM struggles and the 'remaining real-world gap' depends on the benchmark being representative of practical industrial workflows and generated outputs being transformable without loss of functionality. However, the abstract provides no details on the collection protocol, benchmark size, diversity statistics, coverage of requirement changes, or empirical deployment success rates to Dify and Coze.

Authors: We agree that the abstract would benefit from additional context to substantiate the central claims. Although the manuscript body (Sections 3 and 4) provides comprehensive details on the benchmark construction from real-world workflows, the collection protocol, size, diversity, and deployment verification, we will revise the abstract to concisely incorporate key statistics and assurances regarding representativeness and lossless transformation to platforms like Dify and Coze. revision: yes
Referee: [§5 (Experiments)] The reported 5.34% resolve rate gains and 'concrete struggles' are presented without specifying the benchmark size, exact definition and computation of resolve rate, evaluation protocol, or categorization of errors. This prevents a robust assessment of whether the agentic framework's improvements and the identified limitations are statistically meaningful and generalizable.

Authors: We acknowledge the need for greater explicitness in the experimental section. In the revised manuscript, we will update §5 to clearly specify the benchmark size, provide the exact definition and computation method for the resolve rate, outline the full evaluation protocol, and detail the error categorization scheme. This will allow readers to better assess the statistical significance and generalizability of the results, including the 5.34% gains from the agentic framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical benchmark constructed from a collection of external real-world business workflows, with public code release, and reports experimental results on LLM performance plus modest gains from an agentic framework. No equations, fitted parameters renamed as predictions, self-definitional claims, or load-bearing self-citations appear in the abstract or described structure; the central claims about LLM limitations and remaining gaps rest on the benchmark instances themselves rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the assumption that real-world workflows can be collected and transformed into deployable benchmark instances; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Real-world business workflows can be systematically collected and converted into a format that allows direct deployment to practical platforms such as Dify and Coze.
This underpins the benchmark's relevance to industry and the claim that generated workflows are executable in practice.

pith-pipeline@v0.9.0 · 5551 in / 1379 out tokens · 61587 ms · 2026-05-10T02:51:23.631783+00:00 · methodology

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)