Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.
Codeif- bench: Evaluating instruction-following capabilities of large language models in interactive code generation,
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.SE 3years
2026 3representative citing papers
CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.
ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.
citing papers explorer
-
Regression Accumulation in Multi-Turn LLM Programming Conversations
Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.
-
CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues
CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.
-
ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation
ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.