Convcodeworld: Bench- marking conversational code generation in reproducible feedback environments, 2025

Hojae Han, Seung won Hwang, Rajhans Samdani, Yuxiong He · 2025 · arXiv 2502.19852

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues

cs.SE · 2026-06-24 · unverdicted · novelty 7.0

CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

cs.SE · 2026-06-17 · unverdicted · novelty 7.0

StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.

ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation

cs.SE · 2026-07-01 · unverdicted · novelty 6.0

ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.

citing papers explorer

Showing 3 of 3 citing papers.

CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues cs.SE · 2026-06-24 · unverdicted · none · ref 17
CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.
StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns cs.SE · 2026-06-17 · unverdicted · none · ref 17
StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.
ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation cs.SE · 2026-07-01 · unverdicted · none · ref 12
ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.

Convcodeworld: Bench- marking conversational code generation in reproducible feedback environments, 2025

fields

years

verdicts

representative citing papers

citing papers explorer