hub

Test Intention Guided LLM-Based Unit Test Generation

Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, Anita Sarma · 2025 · arXiv 5347.2025

48 Pith papers cite this work. Polarity classification is still indexing.

48 Pith papers citing it

read on arXiv browse 48 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

co-cited works

representative citing papers

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

cs.CR · 2026-05-08 · conditional · novelty 8.0

Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,392 workflows.

Demystifying the Silence of Correctness Bugs in PyTorch Compiler

cs.SE · 2026-04-09 · conditional · novelty 8.0

First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits

cs.SE · 2026-05-11 · accept · novelty 7.0

CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.

ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing

cs.SE · 2026-05-10 · unverdicted · novelty 7.0

ConCovUp uses static analysis to ground LLM test generation and backward tracing to produce concurrent test drivers that raise average shared-memory access pair coverage from 36.6% to 68.1% on nine real-world libraries.

Generating Complex Code Analyzers from Natural Language Questions

cs.SE · 2026-05-10 · unverdicted · novelty 7.0

Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studies while finding additional software issues.

A Learning Method for Symbolic Systems Using Large Language Models

cs.SE · 2026-05-09 · unverdicted · novelty 7.0

LLM2Ltac mines symbolic tactics from 11,725 Coq theorems using LLMs and integrates them into CoqHammer, improving proof rates by 23.87% on 6,199 theorems from four large verification projects.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0

MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achieving top accuracy on most benchmarks and 6.69x speedup over baselines.

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

cs.SE · 2026-05-07 · unverdicted · novelty 7.0 · 4 refs

SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.

VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns

cs.CR · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

VulKey reaches 31.5% repair accuracy on real C/C++ vulnerabilities by matching hierarchical expert patterns to guide LLM patch generation, beating prior baselines by 7.6%.

CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation

cs.SE · 2026-04-21 · unverdicted · novelty 7.0

CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.

Certified Program Synthesis with a Multi-Modal Verifier

cs.SE · 2026-04-17 · unverdicted · novelty 7.0 · 2 refs

LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.

The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE

cs.SE · 2026-04-16 · unverdicted · novelty 7.0

Software engineering scope expands beyond executable code to semi-executable artifacts best diagnosed by the new six-ring Semi-Executable Stack model.

Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap

cs.SE · 2026-04-16 · unverdicted · novelty 7.0

Atropos uses GCN on inference graphs for early failure prediction and hotswaps to larger LLMs, achieving 74% of large-model performance at 24% cost.

Evaluating LLMs Code Reasoning Under Real-World Context

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.

CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

cs.SE · 2026-04-14 · accept · novelty 7.0

CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.

Evaluating LLM Agents on Automated Software Analysis Tasks

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.

ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.

FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.

Measuring LLM Trust Allocation Across Conflicting Software Artifacts

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

TRACE reveals that LLMs detect documentation bugs and contradictions better than subtle implementation drift, with asymmetric sensitivity and poor confidence calibration across seven models on 22k traces.

How Do Developers Interact with AI? An Exploratory Study on Modeling Developer Programming Behavior

cs.SE · 2026-03-28 · unverdicted · novelty 7.0 · 2 refs

Developers using AI assistants exhibit more stable emotions and greater focus on code creation, evaluation, and verification, captured in a new four-dimensional S-IASE model from retrospective labeling of screen recordings, surveys, and interviews.

Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

cs.SE · 2026-05-13 · accept · novelty 6.0

Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.

Generalizing Test Cases for Comprehensive Test Scenario Coverage

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

TestGeneralizer generalizes an initial test into a set of executable tests covering more diverse scenarios, delivering +31.66% mutation-based and +23.08% LLM-assessed scenario coverage gains over ChatTester on 12 open-source Java projects.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions cs.CR · 2026-05-08 · conditional · none · ref 12
Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,392 workflows.
Demystifying the Silence of Correctness Bugs in PyTorch Compiler cs.SE · 2026-04-09 · conditional · none · ref 35
First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.
AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study cs.SE · 2026-04-03 · conditional · none · ref 4
AI models generated nearly 16,000 lines of unit tests in hours and enabled safe large-scale refactoring with up to 78% branch coverage in a case study.

Test Intention Guided LLM-Based Unit Test Generation

hub tools

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer