StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Ldb: A large language model debugger via verifying runtime execution step-by-step
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.
InlineCoder reframes repository-level code generation as function-level coding by using a draft anchor to inline the target function into its call graph for upstream usage and downstream dependency context.
SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.
A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
NL specifications alone do not improve LLM code translation performance, but combining them with source code yields gains in select language pairs with no overall consistent benefit.
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
citing papers explorer
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
Dynamic analysis enhances issue resolution
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
-
Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?
Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.
-
In Line with Context: Repository-Level Code Generation via Context Inlining
InlineCoder reframes repository-level code generation as function-level coding by using a draft anchor to inline the target function into its call graph for upstream usage and downstream dependency context.
-
SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models
SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.
-
Prompt Optimization for LLM Code Generation via Reinforcement Learning
A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
-
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
-
Specification-Driven Code Translation Powered by Large Language Models: How Far Are We?
NL specifications alone do not improve LLM code translation performance, but combining them with source code yields gains in select language pairs with no overall consistent benefit.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.