CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

Dongsheng Ma; Feiyu Xiong; Rui Ling; Sizhe Wang; Wentao Zhang; Yongan Yu; Zhengren Wang; Zhiyu Li

arxiv: 2504.21751 · v4 · submitted 2025-04-30 · 💻 cs.SE · cs.CL

CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

Sizhe Wang , Zhengren Wang , Dongsheng Ma , Yongan Yu , Rui Ling , Zhiyu Li , Feiyu Xiong , Wentao Zhang This is my paper

Pith reviewed 2026-05-22 17:37 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords code generationlarge language modelsbenchmarksmulti-turniterative developmentcode reusedependency treessoftware engineering

0 comments

The pith

Large language models show sharp performance drops when generating code iteratively over multiple turns, worsening with greater dependency complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodeFlowBench to test how well LLMs handle codeflow, the process of adding new features by reusing and extending existing code across successive turns rather than writing standalone snippets. It builds two sets of test cases, one from thousands of competitive programming problems and another drawn from GitHub repositories, then applies a dual assessment protocol together with structural metrics extracted from dependency trees. Experiments on current models document clear degradation as the number of turns increases and show that success rates fall as the web of function dependencies grows denser. This setup matters because everyday software work consists of ongoing refinement and reuse inside larger codebases instead of isolated generation tasks.

Core claim

We formalize the iterative, multi-turn paradigm of code development as codeflow and introduce CodeFlowBench to evaluate LLMs' ability to implement new functionality by reusing existing functions over multiple turns. The benchmark comprises CodeFlowBench-Comp, a collection of over 5,000 competitive programming problems, and CodeFlowBench-Repo, sourced from GitHub repositories. A novel evaluation framework featuring a dual assessment protocol and structural metrics derived from dependency trees is presented. Extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios and that model performance inversely correlates with dependency complexity.

What carries the argument

Codeflow, the iterative multi-turn process of implementing new functionality through reuse of existing functions, evaluated via dependency-tree structural metrics and a dual assessment protocol.

If this is right

LLMs require stronger mechanisms for tracking and reusing prior code artifacts across successive interactions.
Standard single-turn code generation benchmarks underestimate the difficulty of realistic development sequences.
Structural metrics based on dependency trees expose failure modes that functional correctness alone misses.
Progress on code generation will depend on addressing the inverse relationship between dependency complexity and model accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams using LLMs for coding assistance may achieve better results by deliberately limiting the depth of inter-function dependencies within any single task.
Future model training could incorporate synthetic multi-turn dialogues that explicitly build and reference prior code structures.
The benchmark framework could be extended to measure how well agents maintain consistency when humans intervene between turns.

Load-bearing premise

The chosen competitive programming problems and GitHub repositories, together with the dual assessment protocol and dependency-tree metrics, provide a faithful proxy for real-world iterative code development workflows.

What would settle it

Finding that current models maintain or improve their success rates when tested on real developer commit histories with comparable dependency-tree depths and turn counts would undermine the reported performance degradation.

Figures

Figures reproduced from arXiv: 2504.21751 by Dongsheng Ma, Feiyu Xiong, Rui Ling, Sizhe Wang, Wentao Zhang, Yongan Yu, Zhengren Wang, Zhiyu Li.

**Figure 2.** Figure 2: An illustrative example from CodeFlowBench [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The data curation pipeline of CodeFlowBench. In [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Models’ Pass@1 results on multi-turn problems grouped by model categories and turn number. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: An example page of problems on Codeforces, which contains [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: An example of original coding problem we obtained in stage I.To make the content more clear, we remove [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Subfigure (a) is an example of using URL anchor identification technique. The anchor here is each subtitle [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The scraped and processed solution we obtained in stage II. The original problem is [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt template used for code convertion in stage III.The whole content of the example output data is [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: The prompt template used for generating natural language description for each subproblem in stage III. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: An example of subproblems we obtained in stage IV.The solution code of 1946E contains a [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Statistics of the overall-turns and overall-depth metrics in CodeFlowBench-Comp. Subfigure (b) shows [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: An example problem in DomainEval. We use the "method_code" attribute to generate subproblem [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: An example problem in CodeFlowBench-Repo, which has a shared form of problems in CodeFlowBench [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Statistics of CodeFlowBench-Repo. Left: The diversity of domains, with a focus on System and [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Heatmap of models’ pass@1 scores on multi-turn problems within different DSC intervals. [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: Example of an Incomplete Reasoning (IR) Error by Deepseek-V3.The original problem is [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Example 1 of an Insufficient Globalization(IG) error by Deepseek-V3.The original problem is [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Example 2 of an Insufficient Globalization(IG) error by Deepseek-V3.The original problem is [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Example of an Instruction Misinterpretation(IM) error by Deepseek-V3.The original problem is [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

read the original abstract

Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as codeflow and introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs' ability to perform codeflow - implementing new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises two complementary components: CodeFlowBench-Comp, a core collection of 5,000+ competitive programming problems from Codeforces updated via an automated pipeline and CodeFlowBench-Repo, which is sourced from GitHub repositories to better reflect real-world scenarios. Furthermore, a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees is introduced. Extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios. Furthermore, our in-depth analysis illustrates that model performance inversely correlates with dependency complexity. These findings not only highlight the critical challenges for supporting real-world workflows, but also establish CodeFlowBench as an essential tool for advancing code generation research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces CodeFlowBench as the first benchmark for evaluating LLMs on 'codeflow'—iterative, multi-turn code generation that reuses existing functions to implement new functionality. It comprises CodeFlowBench-Comp (5,000+ competitive programming problems from Codeforces, updated via an automated pipeline) and CodeFlowBench-Repo (sourced from GitHub repositories). A dual assessment protocol and structural metrics derived from dependency trees are proposed. The manuscript reports that extensive experiments show significant performance degradation in multi-turn scenarios and that model performance inversely correlates with dependency complexity.

Significance. If the experimental results are substantiated with full methodological details, the benchmark would be a useful addition to code generation research by shifting evaluation toward realistic iterative workflows involving code reuse and maintainability. The dependency-tree metrics and dual assessment protocol offer a concrete way to quantify complexity and quality beyond single-turn pass rates.

major comments (1)

[Abstract] Abstract: The central claims that 'extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios' and that 'model performance inversely correlates with dependency complexity' are presented without any information on the models evaluated, baselines, number of turns, simulation of multi-turn interactions, dependency-tree construction, statistical tests, data splits, or controls for problem difficulty. This absence makes it impossible to assess whether the reported trends support the claims or arise from artifacts in problem selection.

minor comments (1)

[Abstract] Abstract: The notation '5,000+' is imprecise; reporting an exact count or a clear range would improve reproducibility and clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the abstract. We address the major comment below and indicate where revisions will be made to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that 'extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios' and that 'model performance inversely correlates with dependency complexity' are presented without any information on the models evaluated, baselines, number of turns, simulation of multi-turn interactions, dependency-tree construction, statistical tests, data splits, or controls for problem difficulty. This absence makes it impossible to assess whether the reported trends support the claims or arise from artifacts in problem selection.

Authors: We acknowledge that the abstract, due to space constraints, presents the claims at a high level without enumerating methodological specifics. The full manuscript elaborates these elements in the Experiments and Evaluation sections, including the LLMs tested, comparison baselines, the iterative multi-turn protocol, construction of dependency trees from code structure, statistical analysis methods, train/test splits, and stratification by problem complexity. To make the abstract more self-contained while remaining concise, we will add a brief clause referencing the evaluation protocol and complexity controls. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

This is an empirical benchmark paper that introduces CodeFlowBench (CodeFlowBench-Comp and CodeFlowBench-Repo) and reports experimental results on LLM performance. The abstract describes formalizing 'codeflow', collecting problems/repositories, applying a dual assessment protocol and dependency-tree metrics, then measuring degradation and inverse correlation with complexity. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps exist in the provided text. All central claims are direct observations on the new benchmark rather than reductions to inputs by construction, satisfying the self-contained empirical case with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the benchmark construction itself constitutes the primary addition.

pith-pipeline@v0.9.0 · 5702 in / 1097 out tokens · 54373 ms · 2026-05-22T17:37:12.548794+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We parse the AST of the verified solution to extract and topologically sort function dependencies... define Average Pass Depth (APD) and Dependency Structure Complexity (DSC)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios... model performance inversely correlates with dependency complexity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
cs.AI 2026-05 unverdicted novelty 7.0

MCPP is a Monte Carlo simulation-based online planner that improves the probability of agentic workflows completing successfully under explicit budget and deadline constraints compared to baselines on CodeFlow and Pro...
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
cs.AI 2026-05 unverdicted novelty 6.0

MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.
Context Learning for Multi-Agent Discussion
cs.AI 2026-02 unverdicted novelty 6.0

M2CL trains per-agent context generators with a self-adaptive mechanism to maintain coherence and reduce output discrepancies in multi-LLM discussions, yielding 20-50% gains on reasoning, embodied, and mobile control tasks.
A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback
cs.SE 2026-05 unverdicted novelty 5.0

A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 3 Pith papers · 4 internal anchors

[1]

Code reuse in practice: Benefiting or harm- ing technical debt.Journal of Systems and Software, 167:110618. Google. 2025. Gemini-3-flash system card. System card, Google. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The ll...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Cruxeval: A benchmark for code reason- ing, understanding and execution.arXiv preprint arXiv:2401.03065. 9 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Measuring Coding Challenge Competence With APPS

Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938. Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. 2024. Effibench: Benchmarking the efficiency of automatically generated code.Advances in Neural Information Processing Systems, 37:11506– 11544. Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang

Maintaincoder: Maintainable code genera- tion under dynamic requirements.arXiv preprint arXiv:2503.24260. Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang ...

work page arXiv 2023
[5]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. 2023. Intercode: Standardizing and 10 benchmarking interactive coding with execution feed- back.Advances in Neural Information Processing Systems, 36:23826–23854. Zhaojian Yu, Yilun Zhao, Arman Cohan, and Xiao- Ping Zhang. 2024. Humane...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Clarity: Is the statement smoothly and understandably expressed, allowing a reader to grasp the task goal quickly?

work page
[7]

Completeness: Does the description include all key elements needed to accomplish the subtask?

work page
[8]

Accuracy: Is the description free of ambiguity or logical errors, and does it match the problem requirements and the provided code?

work page
[9]

Feasibility: Can an engineer unambiguously determine and implement the required functionality—and pass functional tests—based solely on this description?

work page
[10]

Please do not be overly strict

Professionalism: Does it use accurate, domain- appropriate terminology and a style fitting technical norms of coding tasks? Return your result **only** as a JSON dictionary with these five keys and values of 0 or 1. Please do not be overly strict. Assign a score of 1 to a criterion if it is even partially satisfied, allowing for minor imperfections. Only ...

work page
[11]

21 (a) Distributions of overall-turns and overall-depth

Authenticity across Specialized Domains. 21 (a) Distributions of overall-turns and overall-depth. (b) The Correlations with Rating Levels. Figure 12: Statistics of the overall-turns and overall-depth metrics in CodeFlowBench-Comp. Subfigure (b) shows inflection points at turns = 1 and depth = 1. This is attributed to the fact that competition-level proble...

work page
[12]

method_name

Alignment with Constructive Codeflow Paradigm. While many repository-level benchmarks focus onmaintenancetasks such as issue resolution or debugging within existing codebases, CodeFlow- Bench aims to evaluate the constructive aspect of software engineering, namely to build complex functionality from the ground up. DomainEval fo- cuses on function generati...

work page

[1] [1]

Code reuse in practice: Benefiting or harm- ing technical debt.Journal of Systems and Software, 167:110618. Google. 2025. Gemini-3-flash system card. System card, Google. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The ll...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Cruxeval: A benchmark for code reason- ing, understanding and execution.arXiv preprint arXiv:2401.03065. 9 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Measuring Coding Challenge Competence With APPS

Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938. Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. 2024. Effibench: Benchmarking the efficiency of automatically generated code.Advances in Neural Information Processing Systems, 37:11506– 11544. Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang

Maintaincoder: Maintainable code genera- tion under dynamic requirements.arXiv preprint arXiv:2503.24260. Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang ...

work page arXiv 2023

[5] [5]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. 2023. Intercode: Standardizing and 10 benchmarking interactive coding with execution feed- back.Advances in Neural Information Processing Systems, 36:23826–23854. Zhaojian Yu, Yilun Zhao, Arman Cohan, and Xiao- Ping Zhang. 2024. Humane...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Clarity: Is the statement smoothly and understandably expressed, allowing a reader to grasp the task goal quickly?

work page

[7] [7]

Completeness: Does the description include all key elements needed to accomplish the subtask?

work page

[8] [8]

Accuracy: Is the description free of ambiguity or logical errors, and does it match the problem requirements and the provided code?

work page

[9] [9]

Feasibility: Can an engineer unambiguously determine and implement the required functionality—and pass functional tests—based solely on this description?

work page

[10] [10]

Please do not be overly strict

Professionalism: Does it use accurate, domain- appropriate terminology and a style fitting technical norms of coding tasks? Return your result **only** as a JSON dictionary with these five keys and values of 0 or 1. Please do not be overly strict. Assign a score of 1 to a criterion if it is even partially satisfied, allowing for minor imperfections. Only ...

work page

[11] [11]

21 (a) Distributions of overall-turns and overall-depth

Authenticity across Specialized Domains. 21 (a) Distributions of overall-turns and overall-depth. (b) The Correlations with Rating Levels. Figure 12: Statistics of the overall-turns and overall-depth metrics in CodeFlowBench-Comp. Subfigure (b) shows inflection points at turns = 1 and depth = 1. This is attributed to the fact that competition-level proble...

work page

[12] [12]

method_name

Alignment with Constructive Codeflow Paradigm. While many repository-level benchmarks focus onmaintenancetasks such as issue resolution or debugging within existing codebases, CodeFlow- Bench aims to evaluate the constructive aspect of software engineering, namely to build complex functionality from the ground up. DomainEval fo- cuses on function generati...

work page