Codeif- bench: Evaluating instruction-following capabilities of large language models in interactive code generation,

Peiding Wang, Li Zhang, Fang Liu, Lin Shi, Minxiao Li, Bo Shen, An Fu · 2025 · arXiv 2503.22688

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Regression Accumulation in Multi-Turn LLM Programming Conversations

cs.SE · 2026-07-02 · conditional · novelty 7.0

Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.

CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues

cs.SE · 2026-06-24 · unverdicted · novelty 7.0

CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.

ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation

cs.SE · 2026-07-01 · unverdicted · novelty 6.0

ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Regression Accumulation in Multi-Turn LLM Programming Conversations cs.SE · 2026-07-02 · conditional · none · ref 47
Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.
CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues cs.SE · 2026-06-24 · unverdicted · none · ref 15
CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.
ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation cs.SE · 2026-07-01 · unverdicted · none · ref 32
ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.

Codeif- bench: Evaluating instruction-following capabilities of large language models in interactive code generation,

fields

years

verdicts

representative citing papers

citing papers explorer