pith. machine review for the scientific record. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

1155 papers in cs.SE · page 1

  1. cs.SE 2026-05-14 reviewed
    Semantically grounded agents detect memory bugs in binaries

    Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

    Alfredo Pesoli +4

  2. cs.SE 2026-05-14 reviewed
    Viverra adds verified assertions to LLM-generated C code

    Viverra: Text-to-Code with Guarantees

    Haoze Wu +3

  3. cs.SE 2026-05-14 reviewed
    ML classifier beats rules at spotting BDD refactoring chances

    Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

    Ali Hassaan Mughal +2

  4. cs.SE 2026-05-14 reviewed
    Memory agent keeps repo documentation consistent

    Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

    Changkyu Choi +4

  5. cs.SE 2026-05-14 reviewed
    Retriever beats generator in RAG for code tasks

    Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

    Haoyu Wang +4

  6. cs.SE 2026-05-14 reviewed
    Stale code snippets make models output outdated helpers

    When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

    Haobin Pan +4

  7. cs.CR 2026-05-14 reviewed
    Disguised compliance rules let attackers hijack LLM agents

    Exploiting LLM Agent Supply Chains via Payload-less Skills

    Xing Hu +3

  8. cs.SE 2026-05-14 reviewed
    Multi-agent system automates full library fuzzing lifecycle

    FuzzAgent: Multi-Agent System for Evolutionary Library Fuzzing

    Fengyi Wu +5

  9. cs.SE 2026-05-14 reviewed
    Agents resolve 45 percent of chained package upgrades

    SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

    Chaozheng Wang +7

  10. cs.SE 2026-05-14 reviewed
    Size filter trims 80 percent of tokens from LLM repo inputs

    Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints

    Shweta Mishra

  11. cs.SE 2026-05-14 reviewed
    Valid microservice APIs often fail for AI agents

    Making OpenAPI Documentation Agent-Ready: Detecting Documentation and REST Smells with a Multi-Agent LLM System

    Davi G. Assun\c{c}\~ao Pinheiro +2

  12. cs.CR 2026-05-14 reviewed
    Web agents should plan before seeing page content

    Web Agents Should Adopt the Plan-Then-Execute Paradigm

    Annabella Chow +7

  13. cs.SE 2026-05-14 reviewed
    Failure-guided fuzzing beats random testing for HQC programs

    Failure-Guided Fuzzing for Hybrid Quantum-Classical Programs

    Lei Zhang

  14. cs.SE 2026-05-13 reviewed
    Prompt strategy explains more variation in test diversity than model size when using LLMs…

    LLM-Based Robustness Testing of Microservice Applications: An Empirical Study

    Hrushitha Goud Tigulla +1

  15. cs.SE 2026-05-13 reviewed
    Constrained edits merge checkpoints to lift code agent scores

    CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

    Michele Merler +3

  16. cs.SE 2026-05-13 reviewed
    AI agents speed creation of digital music instruments

    Case Studies and Reflections on Agentic Software Engineering for Rapid Development of Digital Music Instruments

    Matthew John Yee-King

  17. cs.SE 2026-05-13 reviewed
    LLM with SMT solver audits natural-language requirements

    Neurosymbolic Auditing of Natural-Language Software Requirements

    Bethel Hall +1

  18. cs.SE 2026-05-13 reviewed
    LLMs reach only 52% accuracy on HMSC semantic tasks

    (How) Do Large Language Models Understand High-Level Message Sequence Charts?

    Mohammad Reza Mousavi

  19. cs.SE 2026-05-13 reviewed
    LLMs reach only 52% accuracy on HMSC formal semantics

    (How) Do Large Language Models Understand High-Level Message Sequence Charts?

    Mohammad Reza Mousavi

  20. cs.RO 2026-05-13 reviewed
    CARS attributes AV collisions to driver faults

    Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles

    Cheng Wang +7

  21. cs.SE 2026-05-13 reviewed
    SkillOps is a plug-in framework that maintains LLM agent skill libraries by representing…

    SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

    Hongji Pu +2

  22. cs.SE 2026-05-13 reviewed
    Quantifier rewrites and non-alias specs speed GPU verification ninefold

    Scalable Deductive Verification of Data-Level Parallel Programs

    Anton Wijs +2

  23. cs.RO 2026-05-13 reviewed
    Open standards let one agent model run consistently in three simulators

    Integration of an Agent Model into an Open Simulation Architecture for Scenario-Based Testing of Automated Vehicles

    Christian Geller +3

  24. cs.SE 2026-05-13 reviewed
    Runtime pruning cuts tokens 49% for local LLM fault localization

    SieveFL: Hierarchical Runtime-Aware Pruning for Scalable LLM-Based Fault Localization

    Fatemeh Ghassemi +1

  25. cs.SE 2026-05-13 reviewed
    Call stack data improves RL game testing agents

    CA2: Code-Aware Agent for Automated Game Testing

    David Meger +3

  26. cs.SE 2026-05-13 reviewed
    Runtime harness mediates AI agent actions on code projects

    AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

    Hailin Zhong +1

  27. cs.SE 2026-05-13 reviewed
    This paper finds that code generated by large language models has overall readability…

    The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code

    Fengyuan Ran +3

  28. cs.SE 2026-05-13 reviewed
    Noise reshapes mutant detection in quantum programs

    Robust Mutation Analysis of Quantum Programs Under Noise

    E\~naut Mendiluze Usandizaga +4

  29. cs.SE 2026-05-13 reviewed
    Readiness metrics show near-zero link to research software execution success

    ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment

    Daniel Mietchen +4

  30. cs.CR 2026-05-13 reviewed
    Tool finds 545 reference counting bugs in Linux kernel drivers

    Automatic Detection of Reference Counting Bugs in Linux Kernel Drivers

    Joe Hattori +2

  31. cs.AI 2026-05-13 reviewed
    Contrastive semantic model improves code translation

    Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization

    Chen Shen +5

  32. cs.SE 2026-05-13 reviewed
    Toolkit standardizes benchmarks for screenshot-to-code models

    UIBenchKit: A unified toolkit for design-to-code model evaluation

    Chinh T. Le +4

  33. cs.SE 2026-05-13 reviewed
    Code agents solve far fewer issues in full cycles than isolated tasks

    SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

    Hao Guan +10

  34. cs.SE 2026-05-13 reviewed
    Code models miss over 93% of fixes from changes alone

    Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

    Felix M\"achtle +4

  35. cs.CR 2026-05-13 reviewed
    Bonuses for security scans cut issue density in team code

    Security Incentivization: An Empirical Study of how Micropayments Impact Code Security

    Alexander Lercher +7

  36. cs.CL 2026-05-13 reviewed
    LLM JSON stays valid inside tight token budgets

    TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints

    Shuhei Tarashima +1

  37. cs.SE 2026-05-13 reviewed
    Protocols, not code, decide if generated software is admissible

    Protocol-Driven Development: Governing Generated Software Through Invariants and Evidence

    Deying Yu +1

  38. cs.SE 2026-05-13 reviewed
    10.7% of SWE-agent passes are lucky trial-and-error

    AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

    Benjamin Steenhoek +6

  39. cs.SE 2026-05-13 reviewed
    Metadata layer turns legacy SAS reports into AI-ready data

    A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study

    Jaime Yan

  40. cs.SE 2026-05-12 reviewed
    Open-source projects follow product life cycles

    Project Life Cycles in Open-Source Software

    Andrii Ieroshenko +5

  41. cs.SE 2026-05-12 reviewed
    cozy is a comparative binary analysis tool that uses symbolic execution to find…

    Finding a Crab in the C: Assured Translation via Comparative Symbolic Execution

    Caleb Helbling +2

  42. eess.SY 2026-05-12 reviewed
    Natural language runs grid analyses in under two minutes

    Grid-Orch: An LLM-Powered Orchestrator for Distribution Grid Simulation and Analytics

    Boming Liu +2

  43. cs.SE 2026-05-12 reviewed
    Lattice structures LLM judgments for reliable program analysis

    Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis

    Chao Wang +1

  44. cs.SE 2026-05-12 reviewed
    LLMs match human accuracy in spotting usability requirements in reviews

    User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models

    Cedric Wellhausen +2

  45. cs.SE 2026-05-12 reviewed
    Fine-tuned open LLM matches ChatGPT on code feedback quality

    Fine-Tuning Models for Automated Code Review Feedback

    Hind Zantout +3

  46. eess.SY 2026-05-12 reviewed
    Docker container makes Basilisk GN&C simulations reproducible

    Basilisk and Docker for Reproducible GN&C Simulation: A Workflow Reference

    Anubhav Gupta

  47. cs.SE 2026-05-12 reviewed
    Nine LLM audits on prompts found 51 defects and converged to zero

    Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance

    Elias Calboreanu

  48. cs.SE 2026-05-12 reviewed
    MinTEJ terminal editor for Julia uses less memory than VS Code

    Minimalistic Terminal Editor for Julia Programming -- MinTEJ: A Friendly Approach for a Scientific Programmer

    Anurag Sharma +3

  49. cs.SE 2026-05-12 reviewed
    LLMs fail most at strategy in GitHub issue fixes

    Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues

    Guancheng Wang +5

  50. cs.SE 2026-05-12 reviewed
    Partial programs control risk in LLM code generation

    Uncertainty Quantification for LLM-based Code Generation

    Feng Xu +8