archive
Every paper Pith has read. Search by title, abstract, or pith.
1155 papers in cs.SE · page 1
-
Semantically grounded agents detect memory bugs in binaries
Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries
-
Viverra adds verified assertions to LLM-generated C code
Viverra: Text-to-Code with Guarantees
-
ML classifier beats rules at spotting BDD refactoring chances
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
-
Memory agent keeps repo documentation consistent
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
-
Retriever beats generator in RAG for code tasks
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
-
Stale code snippets make models output outdated helpers
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
-
Disguised compliance rules let attackers hijack LLM agents
Exploiting LLM Agent Supply Chains via Payload-less Skills
-
Multi-agent system automates full library fuzzing lifecycle
FuzzAgent: Multi-Agent System for Evolutionary Library Fuzzing
-
Agents resolve 45 percent of chained package upgrades
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
-
Size filter trims 80 percent of tokens from LLM repo inputs
Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints
-
Valid microservice APIs often fail for AI agents
Making OpenAPI Documentation Agent-Ready: Detecting Documentation and REST Smells with a Multi-Agent LLM System
-
Web agents should plan before seeing page content
Web Agents Should Adopt the Plan-Then-Execute Paradigm
-
Failure-guided fuzzing beats random testing for HQC programs
Failure-Guided Fuzzing for Hybrid Quantum-Classical Programs
-
Prompt strategy explains more variation in test diversity than model size when using LLMs…
LLM-Based Robustness Testing of Microservice Applications: An Empirical Study
-
Constrained edits merge checkpoints to lift code agent scores
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
-
AI agents speed creation of digital music instruments
Case Studies and Reflections on Agentic Software Engineering for Rapid Development of Digital Music Instruments
-
LLM with SMT solver audits natural-language requirements
Neurosymbolic Auditing of Natural-Language Software Requirements
-
LLMs reach only 52% accuracy on HMSC semantic tasks
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
-
LLMs reach only 52% accuracy on HMSC formal semantics
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
-
CARS attributes AV collisions to driver faults
Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles
-
SkillOps is a plug-in framework that maintains LLM agent skill libraries by representing…
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
-
Quantifier rewrites and non-alias specs speed GPU verification ninefold
Scalable Deductive Verification of Data-Level Parallel Programs
-
Open standards let one agent model run consistently in three simulators
Integration of an Agent Model into an Open Simulation Architecture for Scenario-Based Testing of Automated Vehicles
-
Runtime pruning cuts tokens 49% for local LLM fault localization
SieveFL: Hierarchical Runtime-Aware Pruning for Scalable LLM-Based Fault Localization
-
Call stack data improves RL game testing agents
CA2: Code-Aware Agent for Automated Game Testing
-
Runtime harness mediates AI agent actions on code projects
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents
-
This paper finds that code generated by large language models has overall readability…
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
-
Noise reshapes mutant detection in quantum programs
Robust Mutation Analysis of Quantum Programs Under Noise
-
Readiness metrics show near-zero link to research software execution success
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment
-
Tool finds 545 reference counting bugs in Linux kernel drivers
Automatic Detection of Reference Counting Bugs in Linux Kernel Drivers
-
Contrastive semantic model improves code translation
Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization
-
Toolkit standardizes benchmarks for screenshot-to-code models
UIBenchKit: A unified toolkit for design-to-code model evaluation
-
Code agents solve far fewer issues in full cycles than isolated tasks
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
-
Code models miss over 93% of fixes from changes alone
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
-
Bonuses for security scans cut issue density in team code
Security Incentivization: An Empirical Study of how Micropayments Impact Code Security
-
LLM JSON stays valid inside tight token budgets
TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints
-
Protocols, not code, decide if generated software is admissible
Protocol-Driven Development: Governing Generated Software Through Invariants and Evidence
-
10.7% of SWE-agent passes are lucky trial-and-error
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
-
Metadata layer turns legacy SAS reports into AI-ready data
A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study
-
Open-source projects follow product life cycles
Project Life Cycles in Open-Source Software
-
cozy is a comparative binary analysis tool that uses symbolic execution to find…
Finding a Crab in the C: Assured Translation via Comparative Symbolic Execution
-
Natural language runs grid analyses in under two minutes
Grid-Orch: An LLM-Powered Orchestrator for Distribution Grid Simulation and Analytics
-
Lattice structures LLM judgments for reliable program analysis
Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis
-
LLMs match human accuracy in spotting usability requirements in reviews
User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models
-
Fine-tuned open LLM matches ChatGPT on code feedback quality
Fine-Tuning Models for Automated Code Review Feedback
-
Docker container makes Basilisk GN&C simulations reproducible
Basilisk and Docker for Reproducible GN&C Simulation: A Workflow Reference
-
Nine LLM audits on prompts found 51 defects and converged to zero
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
-
MinTEJ terminal editor for Julia uses less memory than VS Code
Minimalistic Terminal Editor for Julia Programming -- MinTEJ: A Friendly Approach for a Scientific Programmer
-
LLMs fail most at strategy in GitHub issue fixes
Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues
-
Partial programs control risk in LLM code generation
Uncertainty Quantification for LLM-based Code Generation