MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.
Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Reasoning-tuned LLMs align with human comprehension failure patterns under code obfuscation using the Block Model, unlike instruction-tuned variants.
CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
Cross-lingual RACG shows non-trivial but unequal knowledge transfer across 13 programming languages, depending on linguistic affinity and pretraining diversity, with limited reliance on natural language information when using code-specific retrievers.
citing papers explorer
-
MirrorCode: AI can rebuild entire programs from behavior alone
MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Do Machines Struggle Where Humans Do? LLM and Human Comprehension of Obfuscated Code
Reasoning-tuned LLMs align with human comprehension failure patterns under code obfuscation using the Block Model, unlike instruction-tuned variants.
-
Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning
CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
-
Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation
Cross-lingual RACG shows non-trivial but unequal knowledge transfer across 13 programming languages, depending on linguistic affinity and pretraining diversity, with limited reliance on natural language information when using code-specific retrievers.