The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.
Design by contract for deep learning APIs,
13 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 13verdicts
UNVERDICTED 13representative citing papers
Event-B Agent is an LLM agent that synthesizes, refines, and repairs Event-B formal models from natural language requirements via iterative verification feedback loops.
Agentic interpretation uses lattices to track LLM judgments on decomposed program claims during analysis.
Gleaner replaces slow graph-based trace analysis with bag-of-edges set operations plus log semantics and alarm-driven diversity to deliver faster, higher-fidelity sampling that improves RCA accuracy even at 1% rates.
Using a corpus of 5542 fault-injected traces from 38 DL programs, the study finds a 0.19 balanced accuracy gap in fault diagnosis between within-program and cross-program evaluation caused by program-specific feature structures.
Quarry improves Rocq proof automation success rates by 7-13% under 10-minute budgets via LLM-planned decompositions ranked by a proof-state difficulty model for CoqHammer solvability.
Proposes data-aware static analysis combining data/control flow and API contracts to detect semantic faults in ML code early, shown on sample real-world notebooks.
dille detects silent semantic faults in random forest ML pipelines with 91% precision via data-informed static analysis on Kaggle notebooks, finding 12-18% of scripts affected.
AutoSOUP automates component-level memory-safety verification by generating Safety-Oriented Unit Proofs via three techniques and a hybrid LLM-plus-program-synthesis architecture called LLM-As-Function-Call.
PROMISE reframes automated proof generation as stateful search over structural embeddings of proof states, outperforming prior LLM-based systems by up to 26 points on the seL4 benchmark.
ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.
SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.
LogCopilot is an LLM framework that builds a hierarchical knowledge base from logs and generates/executes LogQL queries from natural language instructions, reporting 76.8% average accuracy across four datasets.
citing papers explorer
-
Analyzing the Narration Gap in LLM-Solver Loops
The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.
-
Event-B Agent: Towards LLM Agent for Formal Model Synthesis and Repair
Event-B Agent is an LLM agent that synthesizes, refines, and repairs Event-B formal models from natural language requirements via iterative verification feedback loops.
-
Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis
Agentic interpretation uses lattices to track LLM judgments on decomposed program claims during analysis.
-
Gleaner: A Semantically-Rich and Efficient Online Sampler for Microservice Diagnostics
Gleaner replaces slow graph-based trace analysis with bag-of-edges set operations plus log semantics and alarm-driven diversity to deliver faster, higher-fidelity sampling that improves RCA accuracy even at 1% rates.
-
Evaluation-Strategy Gap in Fault Diagnosis of Deep Learning Programs
Using a corpus of 5542 fault-injected traces from 38 DL programs, the study finds a 0.19 balanced accuracy gap in fault diagnosis between within-program and cross-program evaluation caused by program-specific feature structures.
-
Planning to Hammer: Difficulty-Aware Decomposition for Automating Rocq Proofs
Quarry improves Rocq proof automation success rates by 7-13% under 10-minute budgets via LLM-planned decompositions ranked by a proof-state difficulty model for CoqHammer solvability.
-
Data-aware Static Analysis: Improving Detection of Semantic Faults in Machine Learning Code Using Data Characteristics
Proposes data-aware static analysis combining data/control flow and API contracts to detect semantic faults in ML code early, shown on sample real-world notebooks.
-
Are We Lost in the Woods? Detecting Silent Semantic Faults for Random Forest Classifiers with Data-informed Static Analysis
dille detects silent semantic faults in random forest ML pipelines with 91% precision via data-informed static analysis on Kaggle notebooks, finding 12-18% of scripts affected.
-
AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification
AutoSOUP automates component-level memory-safety verification by generating Safety-Oriented Unit Proofs via three techniques and a hybrid LLM-plus-program-synthesis architecture called LLM-As-Function-Call.
-
PROMISE: Proof Automation as Structural Imitation of Human Reasoning
PROMISE reframes automated proof generation as stateful search over structural embeddings of proof states, outperforming prior LLM-based systems by up to 26 points on the seL4 benchmark.
-
On Reasoning-Centric LLM-based Automated Theorem Proving
ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.
-
Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis
SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.
-
LogCopilot: Automating Log Aggregation Analysis through Large Language Models
LogCopilot is an LLM framework that builds a hierarchical knowledge base from logs and generates/executes LogQL queries from natural language instructions, reporting 76.8% average accuracy across four datasets.