hub Mixed citations

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

· 2024 · cs.SE · arXiv 2401.03065

Mixed citation behavior. Most common role is background (45%).

33 Pith papers citing it

Background 45% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 5 method 1

citation-polarity summary

background 5 use dataset 3 unclear 2 use method 1

representative citing papers

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.

Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

cs.SE · 2026-04-23 · conditional · novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.

Evaluating LLMs Code Reasoning Under Real-World Context

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.

An Iterative Test-and-Repair Framework for Competitive Code Generation

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

cs.PL · 2026-03-15 · unverdicted · novelty 7.0

s2n-bignum-bench is a new benchmark requiring LLMs to synthesize HOL Light proofs for real-world low-level cryptographic assembly code.

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

cs.SE · 2025-12-16 · unverdicted · novelty 7.0

A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.

Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

cs.SE · 2025-10-17 · unverdicted · novelty 7.0

LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.

Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

cs.SE · 2025-10-16 · unverdicted · novelty 7.0

LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.

CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

cs.SE · 2025-04-30 · unverdicted · novelty 7.0

CodeFlowBench is a new benchmark with 5000+ problems and GitHub-sourced repos that evaluates LLMs on multi-turn code reuse using dependency-tree structural metrics, revealing performance drops as complexity rises.

CodeMind: Evaluating Large Language Models for Code Reasoning

cs.SE · 2024-02-15 · unverdicted · novelty 7.0

CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.

STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

Teaching LLMs Program Semantics via Symbolic Execution Traces

cs.SE · 2026-05-07 · unverdicted · novelty 6.0

Training Qwen3-8B on symbolic execution traces from Soteria improves violation detection in C programs by over 17 points, transfers across five property types, and shows superadditive gains with chain-of-thought.

Hypothesis generation and updating in large language models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

cs.SE · 2026-04-30 · unverdicted · novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

cs.SE · 2026-04-28 · unverdicted · novelty 6.0

CoRE benchmark shows frontier LLMs have large robustness gaps across equivalent code versions and often reach correct outputs via superficial execution without tracking intermediate states.

PrismaDV: Automated Task-Aware Data Unit Test Generation

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt optimization that beats hand-written prompts.

InCoder-32B-Thinking: Industrial Code World Model for Thinking

cs.AR · 2026-04-03 · unverdicted · novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

cs.LG · 2025-12-10 · conditional · novelty 6.0

LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

Generating Verifiable Chain of Thoughts from Exection-Traces

cs.SE · 2025-11-28 · unverdicted · novelty 6.0

A pipeline produces 54,000 execution-trace-verified bi-directional Chain-of-Thought rationales for code, and fine-tuning on them yields gains up to 26.6 points on LiveCodeBench-Exec and similar benchmarks.

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

cs.CL · 2025-07-20 · unverdicted · novelty 6.0

PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

citing papers explorer

Showing 1 of 1 citing paper after filters.

InCoder-32B-Thinking: Industrial Code World Model for Thinking cs.AR · 2026-04-03 · unverdicted · none · ref 14 · internal anchor
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer