hub

AutoCodeRover: Au- tonomous program improvement

Zhang, Yuntong, Ruan, Haifeng, Fan, Zhiyu, Roychoudhury, Abhik , title = · 2024 · arXiv 0212.368038

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13 · conditional · novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0

MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achieving top accuracy on most benchmarks and 6.69x speedup over baselines.

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.

Certified Program Synthesis with a Multi-Modal Verifier

cs.SE · 2026-04-17 · unverdicted · novelty 7.0

LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.

Evaluating LLM Agents on Automated Software Analysis Tasks

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.

ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.

AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits

cs.SE · 2026-04-03 · conditional · novelty 7.0

AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.

Agentic Coding Needs Proactivity, Not Just Autonomy

cs.SE · 2026-05-07 · conditional · novelty 6.0

Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.

SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection

cs.CR · 2026-04-21 · unverdicted · novelty 6.0

SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.

SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

cs.SE · 2026-04-20 · unverdicted · novelty 6.0

SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.

Enhancing Program Repair with Specification Guidance and Intermediate Behavioral Signals

cs.SE · 2026-04-13 · unverdicted · novelty 6.0

SpecTune improves LLM-based automated program repair by deriving localized postconditions at execution checkpoints and using alpha and beta signals to produce precise fault-localization and patch-generation guidance.

AgentReputation: A Decentralized Agentic AI Reputation Framework

cs.AI · 2026-04-30 · unverdicted · novelty 5.0

AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verification in decentralized settings.

From Helpful to Trustworthy: LLM Agents for Pair Programming

cs.SE · 2026-04-11 · unverdicted · novelty 3.0

A research proposal for three studies on multi-agent LLM pair programming that externalizes intent and uses automated validation to increase trustworthiness.

Building an Internal Coding Agent at Zup: Lessons and Open Questions

cs.SE · 2026-04-10 · unverdicted · novelty 3.0

Engineering choices for tools, safety guardrails, and human oversight determine whether an internal coding agent delivers value in practice more than the underlying model quality.

citing papers explorer

Showing 15 of 15 citing papers.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation cs.SE · 2026-05-13 · conditional · none · ref 49
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals cs.SE · 2026-05-08 · unverdicted · none · ref 53
MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achieving top accuracy on most benchmarks and 6.69x speedup over baselines.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery cs.CR · 2026-04-22 · unverdicted · none · ref 48
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
Certified Program Synthesis with a Multi-Modal Verifier cs.SE · 2026-04-17 · unverdicted · none · ref 71
LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.
Evaluating LLM Agents on Automated Software Analysis Tasks cs.SE · 2026-04-13 · unverdicted · none · ref 70
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories cs.SE · 2026-04-08 · unverdicted · none · ref 101
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios cs.SE · 2026-04-08 · unverdicted · none · ref 29
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits cs.SE · 2026-04-03 · conditional · none · ref 50
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
Agentic Coding Needs Proactivity, Not Just Autonomy cs.SE · 2026-05-07 · conditional · none · ref 36
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection cs.CR · 2026-04-21 · unverdicted · none · ref 66
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents cs.SE · 2026-04-20 · unverdicted · none · ref 82
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
Enhancing Program Repair with Specification Guidance and Intermediate Behavioral Signals cs.SE · 2026-04-13 · unverdicted · none · ref 53
SpecTune improves LLM-based automated program repair by deriving localized postconditions at execution checkpoints and using alpha and beta signals to produce precise fault-localization and patch-generation guidance.
AgentReputation: A Decentralized Agentic AI Reputation Framework cs.AI · 2026-04-30 · unverdicted · none · ref 32
AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verification in decentralized settings.
From Helpful to Trustworthy: LLM Agents for Pair Programming cs.SE · 2026-04-11 · unverdicted · none · ref 17
A research proposal for three studies on multi-agent LLM pair programming that externalizes intent and uses automated validation to increase trustworthiness.
Building an Internal Coding Agent at Zup: Lessons and Open Questions cs.SE · 2026-04-10 · unverdicted · none · ref 16
Engineering choices for tools, safety guardrails, and human oversight determine whether an internal coding agent delivers value in practice more than the underlying model quality.

AutoCodeRover: Au- tonomous program improvement

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer