StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
hub Mixed citations
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Mixed citation behavior. Most common role is background (45%).
abstract
We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
s2n-bignum-bench is a new benchmark requiring LLMs to synthesize HOL Light proofs for real-world low-level cryptographic assembly code.
A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.
CodeFlowBench is a new benchmark with 5000+ problems and GitHub-sourced repos that evaluates LLMs on multi-turn code reuse using dependency-tree structural metrics, revealing performance drops as complexity rises.
CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.
CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
Training Qwen3-8B on symbolic execution traces from Soteria improves violation detection in C programs by over 17 points, transfers across five property types, and shows superadditive gains with chain-of-thought.
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
CoRE benchmark shows frontier LLMs have large robustness gaps across equivalent code versions and often reach correct outputs via superficial execution without tracking intermediate states.
PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt optimization that beats hand-written prompts.
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
A pipeline produces 54,000 execution-trace-verified bi-directional Chain-of-Thought rationales for code, and fine-tuning on them yields gains up to 26.6 points on LiveCodeBench-Exec and similar benchmarks.
PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
citing papers explorer
-
InCoder-32B-Thinking: Industrial Code World Model for Thinking
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.