archive

Every paper Pith has read. Search by title, abstract, or pith.

1155 papers in cs.SE · page 1

cs.SE 2026-05-14 reviewed

Semantically grounded agents detect memory bugs in binaries
Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

Alfredo Pesoli +4
cs.SE 2026-05-14 reviewed

Viverra adds verified assertions to LLM-generated C code
Viverra: Text-to-Code with Guarantees

Haoze Wu +3
cs.SE 2026-05-14 reviewed

ML classifier beats rules at spotting BDD refactoring chances
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

Ali Hassaan Mughal +2
cs.SE 2026-05-14 reviewed

Memory agent keeps repo documentation consistent
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

Changkyu Choi +4
cs.SE 2026-05-14 reviewed

Retriever beats generator in RAG for code tasks
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

Haoyu Wang +4
cs.SE 2026-05-14 reviewed

Stale code snippets make models output outdated helpers
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

Haobin Pan +4
cs.CR 2026-05-14 reviewed

Disguised compliance rules let attackers hijack LLM agents
Exploiting LLM Agent Supply Chains via Payload-less Skills

Xing Hu +3
cs.SE 2026-05-14 reviewed

Multi-agent system automates full library fuzzing lifecycle
FuzzAgent: Multi-Agent System for Evolutionary Library Fuzzing

Fengyi Wu +5
cs.SE 2026-05-14 reviewed

Agents resolve 45 percent of chained package upgrades
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

Chaozheng Wang +7
cs.SE 2026-05-14 reviewed

Size filter trims 80 percent of tokens from LLM repo inputs
Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints

Shweta Mishra
cs.SE 2026-05-14 reviewed

Valid microservice APIs often fail for AI agents
Making OpenAPI Documentation Agent-Ready: Detecting Documentation and REST Smells with a Multi-Agent LLM System

Davi G. Assun\c{c}\~ao Pinheiro +2
cs.CR 2026-05-14 reviewed

Web agents should plan before seeing page content
Web Agents Should Adopt the Plan-Then-Execute Paradigm

Annabella Chow +7
cs.SE 2026-05-14 reviewed

Failure-guided fuzzing beats random testing for HQC programs
Failure-Guided Fuzzing for Hybrid Quantum-Classical Programs

Lei Zhang
cs.SE 2026-05-13 reviewed

Prompt strategy explains more variation in test diversity than model size when using LLMs…
LLM-Based Robustness Testing of Microservice Applications: An Empirical Study

Hrushitha Goud Tigulla +1
cs.SE 2026-05-13 reviewed

Constrained edits merge checkpoints to lift code agent scores
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

Michele Merler +3
cs.SE 2026-05-13 reviewed

AI agents speed creation of digital music instruments
Case Studies and Reflections on Agentic Software Engineering for Rapid Development of Digital Music Instruments

Matthew John Yee-King
cs.SE 2026-05-13 reviewed

LLM with SMT solver audits natural-language requirements
Neurosymbolic Auditing of Natural-Language Software Requirements

Bethel Hall +1
cs.SE 2026-05-13 reviewed

LLMs reach only 52% accuracy on HMSC semantic tasks
(How) Do Large Language Models Understand High-Level Message Sequence Charts?

Mohammad Reza Mousavi
cs.SE 2026-05-13 reviewed

LLMs reach only 52% accuracy on HMSC formal semantics
(How) Do Large Language Models Understand High-Level Message Sequence Charts?

Mohammad Reza Mousavi
cs.RO 2026-05-13 reviewed

CARS attributes AV collisions to driver faults
Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles

Cheng Wang +7
cs.SE 2026-05-13 reviewed

SkillOps is a plug-in framework that maintains LLM agent skill libraries by representing…
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

Hongji Pu +2
cs.SE 2026-05-13 reviewed

Quantifier rewrites and non-alias specs speed GPU verification ninefold
Scalable Deductive Verification of Data-Level Parallel Programs

Anton Wijs +2
cs.RO 2026-05-13 reviewed

Open standards let one agent model run consistently in three simulators
Integration of an Agent Model into an Open Simulation Architecture for Scenario-Based Testing of Automated Vehicles

Christian Geller +3
cs.SE 2026-05-13 reviewed

Runtime pruning cuts tokens 49% for local LLM fault localization
SieveFL: Hierarchical Runtime-Aware Pruning for Scalable LLM-Based Fault Localization

Fatemeh Ghassemi +1
cs.SE 2026-05-13 reviewed

Call stack data improves RL game testing agents
CA2: Code-Aware Agent for Automated Game Testing

David Meger +3
cs.SE 2026-05-13 reviewed

Runtime harness mediates AI agent actions on code projects
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

Hailin Zhong +1
cs.SE 2026-05-13 reviewed

This paper finds that code generated by large language models has overall readability…
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code

Fengyuan Ran +3
cs.SE 2026-05-13 reviewed

Noise reshapes mutant detection in quantum programs
Robust Mutation Analysis of Quantum Programs Under Noise

E\~naut Mendiluze Usandizaga +4
cs.SE 2026-05-13 reviewed

Readiness metrics show near-zero link to research software execution success
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment

Daniel Mietchen +4
cs.CR 2026-05-13 reviewed

Tool finds 545 reference counting bugs in Linux kernel drivers
Automatic Detection of Reference Counting Bugs in Linux Kernel Drivers

Joe Hattori +2
cs.AI 2026-05-13 reviewed

Contrastive semantic model improves code translation
Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization

Chen Shen +5
cs.SE 2026-05-13 reviewed

Toolkit standardizes benchmarks for screenshot-to-code models
UIBenchKit: A unified toolkit for design-to-code model evaluation

Chinh T. Le +4
cs.SE 2026-05-13 reviewed

Code agents solve far fewer issues in full cycles than isolated tasks
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

Hao Guan +10
cs.SE 2026-05-13 reviewed

Code models miss over 93% of fixes from changes alone
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

Felix M\"achtle +4
cs.CR 2026-05-13 reviewed

Bonuses for security scans cut issue density in team code
Security Incentivization: An Empirical Study of how Micropayments Impact Code Security

Alexander Lercher +7
cs.CL 2026-05-13 reviewed

LLM JSON stays valid inside tight token budgets
TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints

Shuhei Tarashima +1
cs.SE 2026-05-13 reviewed

Protocols, not code, decide if generated software is admissible
Protocol-Driven Development: Governing Generated Software Through Invariants and Evidence

Deying Yu +1
cs.SE 2026-05-13 reviewed

10.7% of SWE-agent passes are lucky trial-and-error
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Benjamin Steenhoek +6
cs.SE 2026-05-13 reviewed

Metadata layer turns legacy SAS reports into AI-ready data
A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study

Jaime Yan
cs.SE 2026-05-12 reviewed

Open-source projects follow product life cycles
Project Life Cycles in Open-Source Software

Andrii Ieroshenko +5
cs.SE 2026-05-12 reviewed

cozy is a comparative binary analysis tool that uses symbolic execution to find…
Finding a Crab in the C: Assured Translation via Comparative Symbolic Execution

Caleb Helbling +2
eess.SY 2026-05-12 reviewed

Natural language runs grid analyses in under two minutes
Grid-Orch: An LLM-Powered Orchestrator for Distribution Grid Simulation and Analytics

Boming Liu +2
cs.SE 2026-05-12 reviewed

Lattice structures LLM judgments for reliable program analysis
Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis

Chao Wang +1
cs.SE 2026-05-12 reviewed

LLMs match human accuracy in spotting usability requirements in reviews
User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models

Cedric Wellhausen +2
cs.SE 2026-05-12 reviewed

Fine-tuned open LLM matches ChatGPT on code feedback quality
Fine-Tuning Models for Automated Code Review Feedback

Hind Zantout +3
eess.SY 2026-05-12 reviewed

Docker container makes Basilisk GN&C simulations reproducible
Basilisk and Docker for Reproducible GN&C Simulation: A Workflow Reference

Anubhav Gupta
cs.SE 2026-05-12 reviewed

Nine LLM audits on prompts found 51 defects and converged to zero
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance

Elias Calboreanu
cs.SE 2026-05-12 reviewed

MinTEJ terminal editor for Julia uses less memory than VS Code
Minimalistic Terminal Editor for Julia Programming -- MinTEJ: A Friendly Approach for a Scientific Programmer

Anurag Sharma +3
cs.SE 2026-05-12 reviewed

LLMs fail most at strategy in GitHub issue fixes
Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues

Guancheng Wang +5
cs.SE 2026-05-12 reviewed

Partial programs control risk in LLM code generation
Uncertainty Quantification for LLM-based Code Generation

Feng Xu +8