The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.
super hub Canonical reference
2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
Empirical analysis of 444 iOS apps using dynamic traffic interception found 282 leaking LLM API keys across ten providers, with only 28% remediation after three months.
Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,392 workflows.
First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.
AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.
AgentFlow builds a framework-agnostic Agent Dependency Graph from agent program source code to support static analyses such as BOM generation and prompt-to-tool risk detection, evaluated on 5,399 real programs across five frameworks.
Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.
LLMs preserve functional behavior in over 91% of generated VeriFast specifications and source code but achieve only 31.4% verification success, with 94% of failures due to separation logic domain knowledge errors.
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
Defines documentation-to-code equivalence and introduces Documentary to generate matching docs for 53.4% of function snippets, raising LLM output prediction accuracy by 12.8-24.5% over human-written docs.
AutoACSL integrates CPG-based static analysis into LLM prompts to synthesize ACSL specs for C programs, reporting 98% generation success and 96% full proof ratio with Gemini-3 on 604 programs, with 24.7-51.7% gains over code-only baselines.
Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.
Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.
Propilot instantiates 20 tensor-algebra property skeletons into 4,579 executable PBTs for TVM, cutting redundancy 49% and surfacing semantic and numerical errors.
An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.
A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.
ConCovUp uses static analysis to ground LLM test generation and backward tracing to produce concurrent test drivers that raise average shared-memory access pair coverage from 36.6% to 68.1% on nine real-world libraries.
Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studies while finding additional software issues.
LLM2Ltac mines symbolic tactics from 11,725 Coq theorems using LLMs and integrates them into CoqHammer, improving proof rates by 23.87% on 6,199 theorems from four large verification projects.
citing papers explorer
-
Analyzing the Narration Gap in LLM-Solver Loops
The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.
-
AutoACSL: Synthesizing ACSL Specifications by Integrating LLMs with CPG-Based Static Analysis
AutoACSL integrates CPG-based static analysis into LLM prompts to synthesize ACSL specs for C programs, reporting 98% generation success and 96% full proof ratio with Gemini-3 on 604 programs, with 24.7-51.7% gains over code-only baselines.
-
Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs
Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.
-
Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach
A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.
-
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
-
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security
A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.