Empirical study of real-world vibe-coded apps finds recurring vulnerabilities like placeholder logic and secret exposure caused by AI agent limitations such as memory loss and insufficient security knowledge.
hub
In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria) (ISSTA 2024)
33 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
SWE-Explore is a new benchmark evaluating repository exploration by coding agents on 848 issues across 203 repositories, using line-level ground truth from successful agent trajectories and showing agentic methods outperform classical retrieval on coverage and ranking.
The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
AgentBound is the first declarative access control framework for Model Context Protocol servers that generates policies from source code at 80.9% accuracy and blocks most threats in malicious servers with negligible overhead.
Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.
Loc2Repair framework evaluation finds that file-level localization boosts LLM repo repair resolved rates by up to 7.7 percentage points on SWE-bench Verified.
IntentTester migrates tests across libraries using TDL abstraction and multi-agent LLM synthesis, achieving 85% correctness and 74% effectiveness versus 51% and 43% for baselines on nine projects in JSON, HTML, and Time domains.
Exploratory interview study with 17 developers identifies four forms of emergent oversight work for software agents and documents situated challenges and heuristics.
SetupX presents an experiential learning framework for LLM agents that reaches 92% pass rate on functionality-correct repository setup by transferring verified fixes across repositories via XPU representations, LIFO Docker snapshots, and Prosecutor-Judge verification.
A systematic mapping study of 248 papers introduces a taxonomy of synergistic effects, inter-analysis workflows, and mapping functions to catalog patterns in combined program analysis techniques.
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
SpecTune improves LLM-based automated program repair by deriving localized postconditions at execution checkpoints and using alpha and beta signals to produce precise fault-localization and patch-generation guidance.
Empirical study of 3977 agent trajectories finds Python execution errors correlate with lower success rates on GitHub issues, flags challenging errors, and reports three confirmed bugs in the SWE-Bench platform.
citing papers explorer
-
Understanding the (In)Security of Vibe-Coded Applications
Empirical study of real-world vibe-coded apps finds recurring vulnerabilities like placeholder logic and secret exposure caused by AI agent limitations such as memory loss and insufficient security knowledge.
-
SWE-Explore: Benchmarking How Coding Agents Explore Repositories
SWE-Explore is a new benchmark evaluating repository exploration by coding agents on 848 issues across 203 repositories, using line-level ground truth from successful agent trajectories and showing agentic methods outperform classical retrieval on coverage and ranking.
-
Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents
The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
-
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
-
Certified Program Synthesis with a Multi-Modal Verifier
LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
-
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
-
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
-
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
-
AgenticSZZ: Temporal Knowledge Graph-Guided Agentic Bug-Inducing Commit Identification
AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
-
AgentBound: Securing Execution Boundaries of AI Agents
AgentBound is the first declarative access control framework for Model Context Protocol servers that generates policies from source code at 80.9% accuracy and blocks most threats in malicious servers with negligible overhead.
-
Code Researcher: Deep Research Agent for Large Systems Code and Commit History
Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.
-
Loc2Repair: A Framework for Evaluating the Impact of File-Level Issue Localization in Repo-Level LLM Repair
Loc2Repair framework evaluation finds that file-level localization boosts LLM repo repair resolved rates by up to 7.7 percentage points on SWE-bench Verified.
-
IntentTester: Intent-Driven Multi-agent Framework for Cross-Library Test Migration
IntentTester migrates tests across libraries using TDL abstraction and multi-agent LLM synthesis, achieving 85% correctness and 74% effectiveness versus 51% and 43% for baselines on nine projects in JSON, HTML, and Time domains.
-
Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents
Exploratory interview study with 17 developers identifies four forms of emergent oversight work for software agents and documents situated challenges and heuristics.
-
SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?
SetupX presents an experiential learning framework for LLM agents that reaches 92% pass rate on functionality-correct repository setup by transferring verified fixes across repositories via XPU representations, LIFO Docker snapshots, and Prosecutor-Judge verification.
-
Combined Program Analysis Techniques: A Systematic Mapping Study
A systematic mapping study of 248 papers introduces a taxonomy of synergistic effects, inter-analysis workflows, and mapping functions to catalog patterns in combined program analysis techniques.
-
Agentic Coding Needs Proactivity, Not Just Autonomy
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
-
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
-
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
-
Enhancing Program Repair with Specification Guidance and Intermediate Behavioral Signals
SpecTune improves LLM-based automated program repair by deriving localized postconditions at execution checkpoints and using alpha and beta signals to produce precise fault-localization and patch-generation guidance.
-
Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios
Empirical study of 3977 agent trajectories finds Python execution errors correlate with lower success rates on GitHub issues, flags challenging errors, and reports three confirmed bugs in the SWE-Bench platform.
-
CoCoMUT: A Tool for Code-Context Mining and Automated Dataset Generation
CoCoMUT is a reusable pipeline that discovers project structure, constructs call graphs, extracts source, reconciles bytecode to source, and emits versioned JSON datasets of method contexts, demonstrated on 20 Java repositories with 97.8% reconciliation and 99% audit accuracy.
-
Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution
icat-agent improves resolution rates on SWE-bench Verified and Pro by 3.6-18.5% over baselines via event-based multi-agent scaffolding and rubric-driven workflow pivoting while using the same models.
-
Automated Repair of Requirements for Cyber-Physical Systems in Simulink Requirements Tables
A framework repairs CPS requirements in Simulink by leveraging system execution data and is evaluated as effective on six real-world case studies covering 12 requirements.
-
AgentReputation: A Decentralized Agentic AI Reputation Framework
AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verification in decentralized settings.
-
OpDiffer: LLM-Assisted Opcode-Level Differential Testing of Ethereum Virtual Machine
OpDiffer applies LLMs and static analysis to opcode-level differential testing of EVMs, reporting 26 previously unknown bugs across nine implementations along with coverage gains and an estimate that 7.21% of real contracts could trigger the bugs.
-
ClinQueryAgent: A Conversational Agent for Population Health Management
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
-
From Determinism to Delegation: AI-Native Software Engineering and the Evolution of the Agentic Engineer
Software engineering is undergoing a paradigm shift to AI-native practices centered on agentic systems rather than traditional code.
-
From Helpful to Trustworthy: LLM Agents for Pair Programming
A research proposal for three studies on multi-agent LLM pair programming that externalizes intent and uses automated validation to increase trustworthiness.
-
Building an Internal Coding Agent at Zup: Lessons and Open Questions
Engineering choices for tools, safety guardrails, and human oversight determine whether an internal coding agent delivers value in practice more than the underlying model quality.