hub

Test Intention Guided LLM-Based Unit Test Generation

Vulnerability Detection with Code Language Models: How Far are We? · 2025 · arXiv 5347.2025

47 Pith papers cite this work. Polarity classification is still indexing.

47 Pith papers citing it

read on arXiv browse 47 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

co-cited works

representative citing papers

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

cs.CR · 2026-05-08 · conditional · novelty 8.0

Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,392 workflows.

Demystifying the Silence of Correctness Bugs in PyTorch Compiler

cs.SE · 2026-04-09 · conditional · novelty 8.0

First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits

cs.SE · 2026-05-11 · accept · novelty 7.0

CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.

ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing

cs.SE · 2026-05-10 · unverdicted · novelty 7.0

ConCovUp uses static analysis to ground LLM test generation and backward tracing to produce concurrent test drivers that raise average shared-memory access pair coverage from 36.6% to 68.1% on nine real-world libraries.

Generating Complex Code Analyzers from Natural Language Questions

cs.SE · 2026-05-10 · unverdicted · novelty 7.0

Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studies while finding additional software issues.

A Learning Method for Symbolic Systems Using Large Language Models

cs.SE · 2026-05-09 · unverdicted · novelty 7.0

LLM2Ltac mines symbolic tactics from 11,725 Coq theorems using LLMs and integrates them into CoqHammer, improving proof rates by 23.87% on 6,199 theorems from four large verification projects.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0

MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achieving top accuracy on most benchmarks and 6.69x speedup over baselines.

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

cs.SE · 2026-05-07 · unverdicted · novelty 7.0 · 4 refs

SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.

VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns

cs.CR · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

VulKey reaches 31.5% repair accuracy on real C/C++ vulnerabilities by matching hierarchical expert patterns to guide LLM patch generation, beating prior baselines by 7.6%.

CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation

cs.SE · 2026-04-21 · unverdicted · novelty 7.0

CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.

Certified Program Synthesis with a Multi-Modal Verifier

cs.SE · 2026-04-17 · unverdicted · novelty 7.0 · 2 refs

LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.

The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE

cs.SE · 2026-04-16 · unverdicted · novelty 7.0

Software engineering scope expands beyond executable code to semi-executable artifacts best diagnosed by the new six-ring Semi-Executable Stack model.

Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap

cs.SE · 2026-04-16 · unverdicted · novelty 7.0

Atropos uses GCN on inference graphs for early failure prediction and hotswaps to larger LLMs, achieving 74% of large-model performance at 24% cost.

Evaluating LLMs Code Reasoning Under Real-World Context

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.

CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

cs.SE · 2026-04-14 · accept · novelty 7.0

CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.

Evaluating LLM Agents on Automated Software Analysis Tasks

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.

ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.

FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.

Measuring LLM Trust Allocation Across Conflicting Software Artifacts

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

TRACE reveals that LLMs detect documentation bugs and contradictions better than subtle implementation drift, with asymmetric sensitivity and poor confidence calibration across seven models on 22k traces.

Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

cs.SE · 2026-05-13 · accept · novelty 6.0

Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.

Generalizing Test Cases for Comprehensive Test Scenario Coverage

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

TestGeneralizer generalizes an initial test into a set of executable tests covering more diverse scenarios, delivering +31.66% mutation-based and +23.08% LLM-assessed scenario coverage gains over ChatTester on 12 open-source Java projects.

Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation

cs.SE · 2026-04-23 · unverdicted · novelty 6.0 · 2 refs

IRAP quantifies ambiguous performance requirements into mathematical functions via interactive retrieval-augmented preference elicitation and outperforms ten prior methods on four real-world datasets with up to 40x gains in five interaction rounds.

citing papers explorer

Showing 47 of 47 citing papers.

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions cs.CR · 2026-05-08 · conditional · none · ref 12
Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,392 workflows.
Demystifying the Silence of Correctness Bugs in PyTorch Compiler cs.SE · 2026-04-09 · conditional · none · ref 35
First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.
Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach cs.AI · 2026-05-13 · unverdicted · none · ref 32
A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues cs.CL · 2026-05-12 · unverdicted · none · ref 60
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits cs.SE · 2026-05-11 · accept · none · ref 3
CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.
ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing cs.SE · 2026-05-10 · unverdicted · none · ref 41
ConCovUp uses static analysis to ground LLM test generation and backward tracing to produce concurrent test drivers that raise average shared-memory access pair coverage from 36.6% to 68.1% on nine real-world libraries.
Generating Complex Code Analyzers from Natural Language Questions cs.SE · 2026-05-10 · unverdicted · none · ref 34
Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studies while finding additional software issues.
A Learning Method for Symbolic Systems Using Large Language Models cs.SE · 2026-05-09 · unverdicted · none · ref 12
LLM2Ltac mines symbolic tactics from 11,725 Coq theorems using LLMs and integrates them into CoqHammer, improving proof rates by 23.87% on 6,199 theorems from four large verification projects.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals cs.SE · 2026-05-08 · unverdicted · none · ref 4
MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achieving top accuracy on most benchmarks and 6.69x speedup over baselines.
SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair cs.SE · 2026-05-07 · unverdicted · none · ref 6 · 4 links
SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns cs.CR · 2026-05-03 · unverdicted · none · ref 14 · 2 links
VulKey reaches 31.5% repair accuracy on real C/C++ vulnerabilities by matching hierarchical expert patterns to guide LLM patch generation, beating prior baselines by 7.6%.
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation cs.SE · 2026-04-21 · unverdicted · none · ref 6
CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.
Certified Program Synthesis with a Multi-Modal Verifier cs.SE · 2026-04-17 · unverdicted · none · ref 9 · 2 links
LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.
The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE cs.SE · 2026-04-16 · unverdicted · none · ref 48
Software engineering scope expands beyond executable code to semi-executable artifacts best diagnosed by the new six-ring Semi-Executable Stack model.
Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap cs.SE · 2026-04-16 · unverdicted · none · ref 3
Atropos uses GCN on inference graphs for early failure prediction and hotswaps to larger LLMs, achieving 74% of large-model performance at 24% cost.
Evaluating LLMs Code Reasoning Under Real-World Context cs.SE · 2026-04-14 · unverdicted · none · ref 3
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation cs.SE · 2026-04-14 · accept · none · ref 25
CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
Evaluating LLM Agents on Automated Software Analysis Tasks cs.SE · 2026-04-13 · unverdicted · none · ref 5
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories cs.SE · 2026-04-08 · unverdicted · none · ref 61
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems cs.SE · 2026-04-07 · unverdicted · none · ref 53
FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.
Measuring LLM Trust Allocation Across Conflicting Software Artifacts cs.SE · 2026-04-03 · unverdicted · none · ref 9
TRACE reveals that LLMs detect documentation bugs and contradictions better than subtle implementation drift, with asymmetric sensitivity and poor confidence calibration across seven models on 22k traces.
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study cs.SE · 2026-05-13 · accept · none · ref 11
Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.
Generalizing Test Cases for Comprehensive Test Scenario Coverage cs.SE · 2026-04-23 · unverdicted · none · ref 10
TestGeneralizer generalizes an initial test into a set of executable tests covering more diverse scenarios, delivering +31.66% mutation-based and +23.08% LLM-assessed scenario coverage gains over ChatTester on 12 open-source Java projects.
Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation cs.SE · 2026-04-23 · unverdicted · none · ref 7 · 2 links
IRAP quantifies ambiguous performance requirements into mathematical functions via interactive retrieval-augmented preference elicitation and outperforms ten prior methods on four real-world datasets with up to 40x gains in five interaction rounds.
Residual Risk Analysis in Benign Code: How Far Are We? A Multi-Model Semantic and Structural Similarity Approach cs.SE · 2026-04-22 · unverdicted · none · ref 10
Patched functions often remain similar to vulnerable ones, and a new multi-model similarity scoring system identifies residual issues like null pointer dereferences in 61% of high-risk cases from the PrimeVul dataset.
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection cs.CR · 2026-04-21 · unverdicted · none · ref 16
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization cs.AI · 2026-04-19 · unverdicted · none · ref 66
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
SpecPylot: Python Specification Generation using Large Language Models cs.SE · 2026-04-17 · unverdicted · none · ref 12
SpecPylot generates and validates icontract specifications for Python programs by combining LLM proposals with Crosshair symbolic execution feedback.
Debugging Performance Issues in WebAssembly Runtimes via Mutation-based Inference cs.SE · 2026-04-15 · unverdicted · none · ref 62
WarpL uses mutation to find and isolate suboptimal instruction sequences causing performance issues in WebAssembly runtimes by comparing machine code of original and non-problematic mutant programs.
Enhancing Program Repair with Specification Guidance and Intermediate Behavioral Signals cs.SE · 2026-04-13 · unverdicted · none · ref 3
SpecTune improves LLM-based automated program repair by deriving localized postconditions at execution checkpoints and using alpha and beta signals to produce precise fault-localization and patch-generation guidance.
Towards Improving the External Validity of Software Engineering Experiments with Transportability Methods cs.SE · 2026-04-09 · unverdicted · none · ref 18 · 2 links
Transportability methods can transport causal effects from experimental samples to broader target populations in software engineering by leveraging observational data to improve external validity.
Content Fuzzing for Escaping Information Cocoons on Digital Social Media cs.CL · 2026-04-07 · unverdicted · none · ref 71
ContentFuzz rewrites posts with LLM guidance from stance model confidence to flip machine labels without altering human intent, tested across four models and three datasets in two languages.
PROMISE: Proof Automation as Structural Imitation of Human Reasoning cs.LO · 2026-04-07 · unverdicted · none · ref 40
PROMISE reframes automated proof generation as stateful search over structural embeddings of proof states, outperforming prior LLM-based systems by up to 26 points on the seL4 benchmark.
Beyond Crash-to-Patch: Patch Evolution for Linux Kernel Repair cs.SE · 2026-04-04 · unverdicted · none · ref 12
Reconstructing 6946 syzbot bug-fix lifecycles reveals that accepted kernel patches are non-local and reviewer-constrained, enabling PatchAdvisor to improve automated repair quality over baselines via retrieval and diagnostic guidance.
Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs cs.SE · 2026-04-04 · unverdicted · none · ref 14
A small recency window of 3-5 prior ADRs as context produces higher-fidelity LLM-generated Architecture Decision Records than no context, full history, or retrieval-augmented selection in typical sequential workflows.
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents cs.CL · 2026-05-08 · unverdicted · none · ref 64
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines cs.SE · 2026-05-08 · unverdicted · none · ref 6
The central challenge in AI-augmented CI/CD is designing authority transfer from humans to agents under constraints, as current systems remain limited to bounded data-plane autonomy backed by external governance.
Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection cs.CR · 2026-04-23 · unverdicted · none · ref 4
A game-theoretic heterogeneous multi-agent architecture with three cloud LLMs and a local verifier achieves 77.2% F1, 100% recall, and 3x speedup for code vulnerability detection at $0.002 per sample on the NIST Juliet suite.
Systematic Detection of Energy Regression and Corresponding Code Patterns in Java Projects cs.SE · 2026-04-21 · unverdicted · none · ref 47
EnergyTrackr detects statistically significant energy regressions in Java commits from 3,232 changes across three projects and identifies recurring code anti-patterns such as missing early exits.
Relationships Between Trust, Compliance, and Performance for Novice Programmers Using AI Code Generation cs.HC · 2026-04-21 · unverdicted · none · ref 52
Among novice programmers using AI code generators, trust did not predict compliance with suggestions, while performance correlated with both compliance and increased subsequent trust.
MASFuzzer: Fuzz Driver Generation and Adaptive Scheduling via Multidimensional API Sequences cs.SE · 2026-04-20 · unverdicted · none · ref 42
MASFuzzer generates fuzz drivers via mined multidimensional API sequences and adaptive scheduling, delivering 8.54% higher code coverage and 16 new vulnerabilities across 12 libraries.
Investigating Code Reuse in Software Redesign: A Case Study cs.SE · 2026-04-09 · unverdicted · none · ref 67
A case study of software redesign reveals reuse challenges and shows that semantic alignment heuristics with hierarchical clone detection reduce irrelevant clones by 33-99% and raise precision to 86%.
Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis cs.SE · 2026-04-07 · unverdicted · none · ref 7 · 2 links
SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.
AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study cs.SE · 2026-04-03 · conditional · none · ref 4
AI models generated nearly 16,000 lines of unit tests in hours and enabled safe large-scale refactoring with up to 78% branch coverage in a case study.
Can I Check What I Designed? Mapping Security Design DSLs to Code Analyzers cs.CR · 2026-05-08 · unverdicted · none · ref 65
An empirical study of security DSLs and code analyzers finds few common concepts, overly general weakness descriptions, and that even experts are overwhelmed by the complexity of potential mappings.
Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning cs.SE · 2026-04-20 · unverdicted · none · ref 59 · 2 links
MNAL reduces human effort in bug report labeling by up to 95.8% for readability and 196% for identifiability while improving identification performance and working with various neural models.
From Helpful to Trustworthy: LLM Agents for Pair Programming cs.SE · 2026-04-11 · unverdicted · none · ref 9 · 3 links
A research proposal for three studies on multi-agent LLM pair programming that externalizes intent and uses automated validation to increase trustworthiness.

Test Intention Guided LLM-Based Unit Test Generation

hub tools

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer