pith. machine review for the scientific record. sign in

arxiv: 2401.03065 · v1 · submitted 2024-01-05 · 💻 cs.SE · cs.AI· cs.LG

Recognition: 1 theorem link

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu, Armando Solar-Lezama, Baptiste Rozi\`ere, Gabriel Synnaeve, Hugh Leather, Sida I. Wang

Pith reviewed 2026-05-14 20:53 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords CRUXEvalcode reasoningcode executionbenchmarkinput predictionoutput predictionlarge language modelsPython
0
0 comments X

The pith

CRUXEval benchmark reveals GPT-4 with chain-of-thought reaches only 75% and 81% on input and output prediction for short Python functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CRUXEval, a benchmark of 800 short Python functions each paired with an input-output example, to test models on input prediction and output prediction tasks. It proposes a generic generation method for such benchmarks and evaluates twenty code models, finding that many recent high performers on HumanEval do not show comparable gains here. Simple chain-of-thought prompting and fine-tuning improve results but leave models far from solving the tasks, with GPT-4 plus CoT at 75% and 81% pass@1 versus 50% and 46% for Code Llama 34B. A sympathetic reader would care because the results suggest that current code models still lack reliable reasoning about what a program actually does when run.

Core claim

CRUXEval consists of 800 Python functions of 3-13 lines each with an associated input-output pair, defining two tasks of input prediction and output prediction. A recipe generates the benchmark, twenty models are tested showing limited transfer from HumanEval success, and CoT plus fine-tuning raise performance without closing the gap, as GPT-4 with CoT reaches 75% and 81% pass@1 on the two tasks while Code Llama 34B reaches 50% and 46%.

What carries the argument

The CRUXEval benchmark of 800 generated short Python functions each equipped with one input-output pair, supporting separate input-prediction and output-prediction tasks.

If this is right

  • Models that score highest on HumanEval do not automatically excel at basic code execution reasoning measured by input and output prediction.
  • Chain-of-thought prompting and targeted fine-tuning raise accuracy on the benchmark yet leave a large unsolved remainder.
  • Closed-source models maintain a clear lead over open-source ones such as Code Llama 34B on these reasoning tasks.
  • Persistent failures by GPT-4 on simple programs point to concrete, repeatable weaknesses in simulating execution.
  • No current model approaches perfect performance, so the benchmark remains open for future improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • CRUXEval could be used as an auxiliary training objective to push models toward better internal simulation of code runs.
  • Performance on the benchmark may predict how reliably a model can assist with debugging or tracing execution in practice.
  • The generation recipe allows easy creation of harder variants with longer functions or additional languages to test scaling limits.

Load-bearing premise

That the 800 generated functions and their input-output pairs sufficiently represent the range of code reasoning and execution challenges encountered in practical programming.

What would settle it

A new model that scores above 95% pass@1 on both CRUXEval tasks while still producing frequent execution errors on unmodified real-world Python code would falsify the claim that the benchmark measures meaningful gaps in code reasoning.

read the original abstract

We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces CRUXEval, a benchmark consisting of 800 synthetically generated Python functions (3-13 lines each) paired with input-output examples. It defines two tasks—input prediction and output prediction—provides a generic generation recipe, evaluates twenty code models (reporting pass@1 scores), shows that recent high HumanEval performers do not exhibit comparable gains, and demonstrates that chain-of-thought prompting and fine-tuning yield improvements yet leave the benchmark far from solved, with GPT-4+CoT reaching 75%/81% pass@1 versus 50%/46% for Code Llama 34B.

Significance. If the benchmark's functions adequately sample code reasoning demands, the work is significant for supplying a reproducible, extensible evaluation axis that distinguishes execution reasoning from generation-only benchmarks like HumanEval. The direct model comparisons, the generic recipe, and the concrete failure examples provide actionable data for model developers and a template for future benchmark variants.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The generation recipe produces functions restricted to 3-13 lines, yet the manuscript reports no quantitative metrics on control-flow complexity (e.g., cyclomatic complexity), recursion depth, data-structure variety, or edge-case density. Without these statistics, it is difficult to verify that the observed performance gaps (GPT-4+CoT vs. Code Llama 34B) reflect general code-reasoning limitations rather than artifacts of the synthetic distribution.
  2. [§4.2] §4.2 (Model Evaluation): The central claim that 'many recent high-scoring models on HumanEval do not show the same improvements' rests on the reported pass@1 numbers; however, the manuscript does not include a correlation analysis or scatter plot of HumanEval vs. CRUXEval scores across the twenty models, which would strengthen or qualify the claim that the two benchmarks measure distinct capabilities.
minor comments (3)
  1. [§3.1] §3.1: The description of the input-prediction task should explicitly state how the 'input' is sampled when multiple valid inputs exist for a given output.
  2. [Table 1] Table 1 (or equivalent results table): Add standard deviations or confidence intervals for the pass@1 scores to allow readers to assess the stability of the reported gaps.
  3. [§5] §5 (Failure Analysis): The provided GPT-4 failure examples are useful; consider adding a short taxonomy of the most frequent error types (e.g., off-by-one, type confusion) with counts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the benchmark and results.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The generation recipe produces functions restricted to 3-13 lines, yet the manuscript reports no quantitative metrics on control-flow complexity (e.g., cyclomatic complexity), recursion depth, data-structure variety, or edge-case density. Without these statistics, it is difficult to verify that the observed performance gaps (GPT-4+CoT vs. Code Llama 34B) reflect general code-reasoning limitations rather than artifacts of the synthetic distribution.

    Authors: We agree that reporting quantitative metrics on control-flow complexity and related properties would help readers assess the benchmark's coverage and strengthen the interpretation of the results. In the revised manuscript, we will add these statistics to §3, including average and distribution of cyclomatic complexity, recursion depth, data-structure variety (e.g., lists, dicts, sets), and edge-case density across the 800 functions. We will also briefly discuss how the generation recipe was designed to promote diversity in these dimensions. revision: yes

  2. Referee: [§4.2] §4.2 (Model Evaluation): The central claim that 'many recent high-scoring models on HumanEval do not show the same improvements' rests on the reported pass@1 numbers; however, the manuscript does not include a correlation analysis or scatter plot of HumanEval vs. CRUXEval scores across the twenty models, which would strengthen or qualify the claim that the two benchmarks measure distinct capabilities.

    Authors: We appreciate this suggestion to make the distinction between benchmarks more rigorous. In the revised §4.2, we will add a scatter plot comparing HumanEval pass@1 scores against CRUXEval input- and output-prediction scores for all twenty models, along with the Pearson correlation coefficient. This will provide quantitative support for the claim that high HumanEval performance does not necessarily translate to CRUXEval. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements only

full rationale

The paper introduces CRUXEval as a new benchmark of 800 short Python functions and reports direct pass@1 measurements for input/output prediction on existing models (GPT-4+CoT, Code Llama, etc.). No derivation chain, fitted parameters, equations, or predictions exist; the generation recipe is presented as a construction method for the test set itself rather than a claim that reduces to its own outputs. No self-citations are invoked to justify uniqueness or forbid alternatives. All reported numbers are external evaluations against the newly created functions, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about deterministic Python execution semantics for the generated test cases and conventional pass@k evaluation protocols; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The generated Python functions execute deterministically with the provided inputs to produce the stated outputs without side effects or undefined behavior.
    Required for the input-output pairs to serve as valid ground truth for the prediction tasks.

pith-pipeline@v0.9.0 · 5538 in / 1251 out tokens · 74293 ms · 2026-05-14T20:53:05.420039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  2. Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

    cs.AI 2026-05 unverdicted novelty 7.0

    Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.

  3. Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

    cs.SE 2026-04 conditional novelty 7.0

    Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.

  4. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  5. Evaluating LLMs Code Reasoning Under Real-World Context

    cs.SE 2026-04 unverdicted novelty 7.0

    R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.

  6. An Iterative Test-and-Repair Framework for Competitive Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.

  7. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

    cs.AI 2026-05 unverdicted novelty 6.0

    CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...

  8. Teaching LLMs Program Semantics via Symbolic Execution Traces

    cs.SE 2026-05 unverdicted novelty 6.0

    Training Qwen3-8B on symbolic execution traces from Soteria improves violation detection in C programs by over 17 points, transfers across five property types, and shows superadditive gains with chain-of-thought.

  9. Hypothesis generation and updating in large language models

    cs.LG 2026-05 unverdicted novelty 6.0

    LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.

  10. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

    cs.SE 2026-04 unverdicted novelty 6.0

    Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

  11. CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

    cs.SE 2026-04 unverdicted novelty 6.0

    CoRE benchmark shows frontier LLMs have large robustness gaps across equivalent code versions and often reach correct outputs via superficial execution without tracking intermediate states.

  12. PrismaDV: Automated Task-Aware Data Unit Test Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt ...

  13. InCoder-32B-Thinking: Industrial Code World Model for Thinking

    cs.AR 2026-04 unverdicted novelty 6.0

    InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

  14. LLaDA2.0: Scaling Up Diffusion Language Models to 100B

    cs.LG 2025-12 conditional novelty 6.0

    LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

  15. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  16. StarCoder 2 and The Stack v2: The Next Generation

    cs.SE 2024-02 accept novelty 6.0

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

  17. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  18. Qwen3 Technical Report

    cs.CL 2025-05 unverdicted novelty 5.0

    Pith review generated a malformed one-line summary.

  19. Qwen2.5-Coder Technical Report

    cs.CL 2024-09 unverdicted novelty 4.0

    Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

  20. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 20 Pith papers · 2 internal anchors

  1. [1]

    A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

    (Cited on pg. 2, 3, 4) Barone, A. V . M. and Sennrich, R. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275, 2017. (Cited on pg. 4) Berabi, B., He, J., Raychev, V ., and Vechev, M. Tfix: Learning to fix coding errors with a text-to-text transformer. In Inte...

  2. [2]

    4) Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J

    (Cited on pg. 4) Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J. D., and Papailiopoulos, D. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023. (Cited on pg. 4) Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P ., Levine, S., and Song, D. The false promise of imitating proprietary llms. arXiv prep...

  3. [3]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    (Cited on pg. 4) Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brockschmidt, M. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019. (Cited on pg. 4) Iyer, S., Konstas, I., Cheung, A., and Zettlemoyer, L. Summarizing source code using a neural attention model. In 54th Annual Meeting of the As...

  4. [4]

    4) Liu, C., Lu, S., Chen, W., Jiang, D., Svyatkovskiy, A., Fu, S., Sundaresan, N., and Duan, N

    (Cited on pg. 4) Liu, C., Lu, S., Chen, W., Jiang, D., Svyatkovskiy, A., Fu, S., Sundaresan, N., and Duan, N. Code execution with pre-trained language models. arXiv preprint arXiv:2305.05383, 2023a. (Cited on pg. 4) Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., and Zhang, Y. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint a...

  5. [5]

    4) Mir, A

    (Cited on pg. 4) Mir, A. M., Latoˇskinas, E., Proksch, S., and Gousios, G. Type4py: Practical deep similarity learning- based type inference for python. In Proceedings of the 44th International Conference on Software Engineering, pp. 2241–2252, 2022. (Cited on pg. 4) Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., and Stanovsky, G. State of wha...

  6. [6]

    step count

    (Cited on pg. 3) Tian, Z. and Chen, J. Test-case-driven programming understanding in large language models for better code generation. arXiv preprint arXiv:2309.16120, 2023. (Cited on pg. 4) Tony, C., Mutas, M., Ferreyra, N. E. D., and Scandariato, R. Llmseceval: A dataset of natural language prompts for security evaluations. arXiv preprint arXiv:2303.093...

  7. [7]

    Specifically, we remove samples that match functions used in the benchmark, even if the input-output pairs are different

    Direct fine-tuning leads to modest performance improvements : In the first setup, we analyze a stronger decontamination setup than that in the main text. Specifically, we remove samples that match functions used in the benchmark, even if the input-output pairs are different. In Fig. 27, we show the train and test accuracy of the model during the finetunin...

  8. [8]

    reversal curse

    The format of fine-tuning data greatly impacts its effectiveness : We also discovered that it is important that the finetuning assertions be formatted in the same way as when evaluating the model at test time. As evidence of this, we fine-tune Code Llama 34B with two different sets of assertions, one on assert output == f(input) assertions and the other o...

  9. [9]

    am7 fiDfd n

    Including benchmark programs still cannot improve test accuracy beyond 70% : Finally, we explore the upper limits of fine-tuning on functions and assertions via a ”cheating” setup. We curate a small set of 7259 samples consisting only of programs in the benchmark but with different input-output pairs. We finetune on a mixture of 50% of the original finetu...

  10. [10]

    Remove all spaces: ’&CC#BiOt&[(15-5%#-u&1@t’

  11. [11]

    Remove all ’@’ symbols: ’&CC#BiOt&[(15-5%#-u&1t’

  12. [12]

    Remove all ’&’ symbols: ’CC#BiOt[(15-5%#-u1t’ So, the output of the function for the given string should be ’CC#BiOt[(15-5%#-u1t’. [ANSWER] ’CC#BiOt[(15-5%#-u1t’ [/ANSWER] def f(widths): reversed_names = [] for skill in sorted(widths, key=widths.get, reverse=True): reversed_names.append("{0}={1}".format(widths[skill], skill)) return ’&’.join(reversed_name...

  13. [13]

    The dictionary {’craftsmanship’:2, ’upholstery’:1, ’tailoring’:3} is sorted based on the values in ,→ descending order, resulting in the list [’tailoring’, ’craftsmanship’, ’upholstery’]

  14. [14]

    {value}={key}

    For each skill in the list, a string is created with the format "{value}={key}". The resulting list ,→ is [’3=tailoring’, ’2=craftsmanship’, ’1=upholstery’]

  15. [15]

    2 to the ,→ power of (power + 1)-free

    The strings in the list are joined with an ampersand (&), resulting in the string ’3=tailoring&2= ,→ craftsmanship&1=upholstery’. So, the output of the function f({’craftsmanship’:2, ’upholstery’:1, ’tailoring’:3}) should be ’3= ,→ tailoring&2=craftsmanship&1=upholstery’. [ANSWER]’3=tailoring&2=craftsmanship&1=upholstery’[/ANSWER] In the next two examples...

  16. [16]

    h e l l o

    The input string ‘"h e l l o"‘ is truthy, so the condition ‘str‘ is ‘True‘

  17. [17]

    h e l l o

    The ‘sorted‘ function is called on the string, which results in the list ‘[’ ’, ’ ’, ’ ’, ’e’, ’h’, ,→ ’l’, ’l’, ’o’]‘. ... Prompt: In Python, what is sorted("h e l l o")? GPT-4: ...Here’s what happens: ‘‘‘python result = sorted("h e l l o") ‘‘‘ The ‘result‘ will be a list that contains the characters of the string, including spaces, sorted in ,→ ascendin...

  18. [18]

    The string ‘’tcmfsmj’‘ is reversed, resulting in ‘’jmsfmct’‘

  19. [19]

    The ‘strip‘ method is called with ‘’cfj’‘ as the argument, which removes any leading or trailing ,→ characters found in ‘’cfj’‘ from the reversed string ‘’jmsfmct’‘

  20. [20]

    The leading ‘’j’‘ and the trailing ‘’c’‘ and ‘’t’‘ are stripped from the string, resulting in ‘’ ,→ msfm’‘

  21. [21]

    jmsfmct".strip(

    The stripped string ‘’msfm’‘ is then reversed again, resulting in ‘’mfsm’‘. Therefore, the assertion should be: ‘‘‘python 58 assert f(’tcmfsmj’, ’cfj’) == ’mfsm’ ‘‘‘ The output of the code, if the assertion passes, is nothing since assertions do not produce output if ,→ the condition is true. Prompt: In Python, what is "jmsfmct".strip("cfj")? GPT-4: ... H...

  22. [22]

    ‘str(res)‘ converts the integer ‘res‘ to a string, which is ‘"123"‘

  23. [23]

    ‘len(str(res))‘ calculates the length of the string ‘"123"‘, which is ‘3‘

  24. [24]

    !12"‘) ,→ , so the result will be: ‘‘‘python

    ‘text[len(str(res)):]‘ slices the string ‘text‘ starting from the index equal to the length of the ,→ string representation of ‘res‘ (which is ‘3‘) to the end of the string. 67 So, ‘text[len(str(res)):]‘ will return the substring of ‘text‘ starting from the 4th character (since ,→ indexing in Python starts at 0) to the end of the string. The 4th character...

  25. [25]

    or"‘ in ‘

    Python finds the first occurrence of ‘"or"‘ in ‘"pomodoro"‘, which is at the end of the string

  26. [26]

    pomodoro

    Python replaces this occurrence with ‘"pomodoro"‘, resulting in the string ‘"pomodopomodoro"‘. There is only one occurrence of ‘"or"‘ in the original string, so the final result of the ‘replace‘ ,→ method in this case is ‘"pomodopomodoro"‘. E.6 Failing Gems Next, we identify a collection of examples that we find GPT-4 often fails on, even with CoT. Some o...