Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution

· 2024 · arXiv 2408.13001

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

MirrorCode: AI can rebuild entire programs from behavior alone

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

Do Machines Struggle Where Humans Do? LLM and Human Comprehension of Obfuscated Code

cs.SE · 2026-06-30 · unverdicted · novelty 6.0

Reasoning-tuned LLMs align with human comprehension failure patterns under code obfuscation using the Block Model, unlike instruction-tuned variants.

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.

Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation

cs.SE · 2025-06-04 · accept · novelty 6.0

Cross-lingual RACG shows non-trivial but unequal knowledge transfer across 13 programming languages, depending on linguistic affinity and pretraining diversity, with limited reliance on natural language information when using code-specific retrievers.

citing papers explorer

Showing 5 of 5 citing papers.

MirrorCode: AI can rebuild entire programs from behavior alone cs.AI · 2026-06-29 · unverdicted · none · ref 20
MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning cs.SE · 2026-05-12 · unverdicted · none · ref 33
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Do Machines Struggle Where Humans Do? LLM and Human Comprehension of Obfuscated Code cs.SE · 2026-06-30 · unverdicted · none · ref 8
Reasoning-tuned LLMs align with human comprehension failure patterns under code obfuscation using the Block Model, unlike instruction-tuned variants.
Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 41
CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation cs.SE · 2025-06-04 · accept · none · ref 57
Cross-lingual RACG shows non-trivial but unequal knowledge transfer across 13 programming languages, depending on linguistic affinity and pretraining diversity, with limited reliance on natural language information when using code-specific retrievers.

Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution

fields

years

verdicts

representative citing papers

citing papers explorer