super hub Canonical reference

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

Abhik Roychoudhury, Haifeng Ruan, View Profile, Yuntong Zhang · 2025 · arXiv 5347.2025

Canonical reference. 80% of citing Pith papers cite this work as background.

127 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 127 citing papers more from Abhik Roychoudhury

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 34 dataset 2 method 2 baseline 1 other 1

citation-polarity summary

background 32 support 2 use dataset 2 use method 2 baseline 1 unclear 1

authors

Abhik Roychoudhury and View Profile Haifeng Ruan View Profile Yuntong Zhang

co-cited works

representative citing papers

Analyzing the Narration Gap in LLM-Solver Loops

cs.AI · 2026-06-17 · unverdicted · novelty 8.0

The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.

Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps

cs.SE · 2026-06-10 · unverdicted · novelty 8.0

Empirical analysis of 444 iOS apps using dynamic traffic interception found 282 leaking LLM API keys across ten providers, with only 28% remediation after three months.

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

cs.CR · 2026-05-08 · conditional · novelty 8.0

Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,392 workflows.

Demystifying the Silence of Correctness Bugs in PyTorch Compiler

cs.SE · 2026-04-09 · conditional · novelty 8.0

First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

cs.SE · 2025-07-20 · conditional · novelty 8.0

AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.

AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs

cs.SE · 2026-07-02 · unverdicted · novelty 7.0

AgentFlow builds a framework-agnostic Agent Dependency Graph from agent program source code to support static analyses such as BOM generation and prompt-to-tool risk detection, evaluated on 5,399 real programs across five frameworks.

An Empirical Study of Security Calibration in Large Language Models for Code

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.

An Empirical Study of LLM-Generated Specifications for VeriFast

cs.SE · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

LLMs preserve functional behavior in over 91% of generated VeriFast specifications and source code but achieve only 31.4% verification success, with 94% of failures due to separation logic domain knowledge errors.

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.

Natural Language-Focused Software Engineering via Code-Documentation Equivalence

cs.SE · 2026-06-20 · unverdicted · novelty 7.0

Defines documentation-to-code equivalence and introduces Documentary to generate matching docs for 53.4% of function snippets, raising LLM output prediction accuracy by 12.8-24.5% over human-written docs.

AutoACSL: Synthesizing ACSL Specifications by Integrating LLMs with CPG-Based Static Analysis

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

AutoACSL integrates CPG-based static analysis into LLM prompts to synthesize ACSL specs for C programs, reporting 98% generation success and 96% full proof ratio with Gemini-3 on 604 programs, with 24.7-51.7% gains over code-only baselines.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

cs.AI · 2026-06-07 · unverdicted · novelty 7.0

Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.

Tensor Algebraic Property Skeletons: Amplifying Property-Based Testing for AI Compilers

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

Propilot instantiates 20 tensor-algebra property skeletons into 4,579 executable PBTs for TVM, cutting redundancy 49% and surfacing semantic and numerical errors.

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

cs.SE · 2026-05-29 · unverdicted · novelty 7.0 · 2 refs

An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0 · 4 refs

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

cs.SE · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits

cs.SE · 2026-05-11 · accept · novelty 7.0

CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.

ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing

cs.SE · 2026-05-10 · unverdicted · novelty 7.0

ConCovUp uses static analysis to ground LLM test generation and backward tracing to produce concurrent test drivers that raise average shared-memory access pair coverage from 36.6% to 68.1% on nine real-world libraries.

Generating Complex Code Analyzers from Natural Language Questions

cs.SE · 2026-05-10 · unverdicted · novelty 7.0

Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studies while finding additional software issues.

A Learning Method for Symbolic Systems Using Large Language Models

cs.SE · 2026-05-09 · unverdicted · novelty 7.0

LLM2Ltac mines symbolic tactics from 11,725 Coq theorems using LLMs and integrates them into CoqHammer, improving proof rates by 23.87% on 6,199 theorems from four large verification projects.

citing papers explorer

Showing 6 of 6 citing papers after filters.

Analyzing the Narration Gap in LLM-Solver Loops cs.AI · 2026-06-17 · unverdicted · none · ref 30
The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.
AutoACSL: Synthesizing ACSL Specifications by Integrating LLMs with CPG-Based Static Analysis cs.AI · 2026-06-18 · unverdicted · none · ref 24
AutoACSL integrates CPG-based static analysis into LLM prompts to synthesize ACSL specs for C programs, reporting 98% generation success and 96% full proof ratio with Gemini-3 on 604 programs, with 24.7-51.7% gains over code-only baselines.
Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs cs.AI · 2026-06-07 · unverdicted · none · ref 40
Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.
Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach cs.AI · 2026-05-13 · unverdicted · none · ref 32
A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization cs.AI · 2026-04-19 · unverdicted · none · ref 66
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security cs.AI · 2026-05-17 · unverdicted · none · ref 70
A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer