StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
arXiv preprint arXiv:2402.16906 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 1polarities
background 1representative citing papers
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
citing papers explorer
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
Dynamic analysis enhances issue resolution
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.