super hub Canonical reference

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

Abhik Roychoudhury, Haifeng Ruan, View Profile, Yuntong Zhang · 2025 · arXiv 5347.2025

Canonical reference. 80% of citing Pith papers cite this work as background.

117 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 117 citing papers more from Abhik Roychoudhury

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 34 dataset 2 method 2 baseline 1 other 1

citation-polarity summary

background 32 support 2 use dataset 2 use method 2 baseline 1 unclear 1

authors

Abhik Roychoudhury and View Profile Haifeng Ruan View Profile Yuntong Zhang

co-cited works

representative citing papers

Analyzing the Narration Gap in LLM-Solver Loops

cs.AI · 2026-06-17 · unverdicted · novelty 8.0

The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

cs.CR · 2026-05-08 · conditional · novelty 8.0

Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,392 workflows.

Demystifying the Silence of Correctness Bugs in PyTorch Compiler

cs.SE · 2026-04-09 · conditional · novelty 8.0

First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

cs.SE · 2025-07-20 · conditional · novelty 8.0

AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.

An Empirical Study of LLM-Generated Specifications for VeriFast

cs.SE · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

LLMs preserve functional behavior in over 91% of generated VeriFast specifications and source code but achieve only 31.4% verification success, with 94% of failures due to separation logic domain knowledge errors.

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.

Natural Language-Focused Software Engineering via Code-Documentation Equivalence

cs.SE · 2026-06-20 · unverdicted · novelty 7.0

Defines documentation-to-code equivalence and introduces Documentary to generate matching docs for 53.4% of function snippets, raising LLM output prediction accuracy by 12.8-24.5% over human-written docs.

AutoACSL: Synthesizing ACSL Specifications by Integrating LLMs with CPG-Based Static Analysis

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

AutoACSL integrates CPG-based static analysis into LLM prompts to synthesize ACSL specs for C programs, reporting 98% generation success and 96% full proof ratio with Gemini-3 on 604 programs, with 24.7-51.7% gains over code-only baselines.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

cs.AI · 2026-06-07 · unverdicted · novelty 7.0

Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.

Tensor Algebraic Property Skeletons: Amplifying Property-Based Testing for AI Compilers

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

Propilot instantiates 20 tensor-algebra property skeletons into 4,579 executable PBTs for TVM, cutting redundancy 49% and surfacing semantic and numerical errors.

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

cs.SE · 2026-05-29 · unverdicted · novelty 7.0

An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0 · 4 refs

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

cs.SE · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits

cs.SE · 2026-05-11 · accept · novelty 7.0

CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.

ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing

cs.SE · 2026-05-10 · unverdicted · novelty 7.0

ConCovUp uses static analysis to ground LLM test generation and backward tracing to produce concurrent test drivers that raise average shared-memory access pair coverage from 36.6% to 68.1% on nine real-world libraries.

Generating Complex Code Analyzers from Natural Language Questions

cs.SE · 2026-05-10 · unverdicted · novelty 7.0

Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studies while finding additional software issues.

A Learning Method for Symbolic Systems Using Large Language Models

cs.SE · 2026-05-09 · unverdicted · novelty 7.0

LLM2Ltac mines symbolic tactics from 11,725 Coq theorems using LLMs and integrates them into CoqHammer, improving proof rates by 23.87% on 6,199 theorems from four large verification projects.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

cs.SE · 2026-05-07 · unverdicted · novelty 7.0 · 4 refs

SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.

VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns

cs.CR · 2026-05-03 · unverdicted · novelty 7.0 · 4 refs

VulKey introduces hierarchical expert knowledge abstractions to guide LLMs in vulnerability repair, reporting 31.5% accuracy on PrimeVul (7.6% above best baseline) and strong results on Vul4J.

citing papers explorer

Showing 2 of 2 citing papers after filters.

VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns cs.CR · 2026-05-03 · unverdicted · none · ref 14 · 4 links
VulKey introduces hierarchical expert knowledge abstractions to guide LLMs in vulnerability repair, reporting 31.5% accuracy on PrimeVul (7.6% above best baseline) and strong results on Vul4J.
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study cs.SE · 2026-05-13 · accept · none · ref 11
Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer