cs.SE — Pith

0

cs.SE 2026-05-13 1 theorem

Nine LLM audits on prompts found 51 defects and converged to zero

Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance

Expanded-scope rounds in a 7150-line multi-agent system caught issues single-file checks missed.

abstract click to expand

Prompt specifications for multi-agent large language model (LLM) systems carry data contracts and integration logic across many interdependent files but are rarely subjected to structured-inspection rigor. This paper reports a single-system empirical case study of iterative, agent-driven auditing applied to AEGIS (Autonomous Engineering Governance and Intelligence System), a production seven-lane orchestration pipeline whose prompt-specification surface comprises approximately 7150 lines: 6907 across seven lane PROMPT.md files and a 245-line shared Ticket Contract. Nine sequential audit rounds, executed by Claude sub-agents using a checklist-driven walkthrough adapted from Weinberg and Freedman, surfaced 51 prompt-specification consistency defects, distinct from the 51 STRIDE-categorized adversarial code findings reported in the companion preprint. Per-round counts were 15, 8, 12, 2, 8, 1, 4, 1, and 0. We report a seven-category post-hoc defect taxonomy with explicit coding rules, observed non-monotonic convergence consistent with cascading edits and audit-scope expansion, and an audit protocol distilled from the study, with the final locked checklist released as a reproducibility appendix. Single-file review missed defect classes that were surfaced only by later expanded-scope rounds in this system. The same LLM family authored and audited the specifications; replication with dissimilar models and human reviewers is required before generalization.

0

cs.SE 2026-05-13 Recognition

MinTEJ terminal editor for Julia uses less memory than VS Code

Minimalistic Terminal Editor for Julia Programming -- MinTEJ: A Friendly Approach for a Scientific Programmer

Shared-buffer modal design unifies editing, execution, and debugging in one lightweight terminal tool.

abstract click to expand

Developers rely on lightweight, terminal-centric workflows for rapid code iteration. However, within a unified environment for Julia programming language, existing tools provide limited support for integrated workflow such as editing, execution, file management, and debugging. As a result, developers frequently incur context-switching overhead and fragmented tool interactions. Therefore, the proposed work predominantly focuses on the minimalistic approach for developing native terminal editor for Julia programming language. This paper introduces MinTEJ, a terminal-based editor built in Julia, and proposes a Sequential Modal Interaction Architecture (SMIA) that unifies file management, code editing, execution, and debugging through a command-oriented workflow. The presented work formalizes model interaction and reduces cognitive load & errors while transitioning among different modes. In SMIA, buffer is the central data structure that persists across all modes. Each mode interprets and manipulates the buffer according to mode-specific rules. The central controller mediates access to the buffer and enforces sequential transitions between modes. To evaluate the approach, the performance benchmarking of MinTEJ is compared against existing tools i.e., VS code and Notepad++. The effectiveness of the proposed MinTEJ is evaluated based on memory consumption and CPU utilization demonstrating that it has less resource overhead. Findings suggest that integrated terminal-based editor environment is a practical lightweight software tool enabling efficient iterative development.

0

cs.SE 2026-05-13 Recognition

LLMs fail most at strategy in GitHub issue fixes

Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues

Strategy formulation and logic synthesis cause the highest error rates while localization succeeds more often across tested models.

abstract click to expand

Large Language Models (LLMs) are increasingly deployed to resolve real-world GitHub issues. However, despite their potential, the specific failure modes of these models in complex repair tasks remain poorly understood. To characterize how LLM behavior diverges from human developer practices, this paper evaluates three state-of-the-art models, i.e., Claude 4.5 Sonnet, Gemini 3 Pro, and GPT-5, on the SWE-bench Verified dataset. We conduct a rigorous manual analysis of the symptoms and root causes underlying 243 failed attempts across 900 total trials. Our investigation first yields a unified failure taxonomy encompassing five distinct stages of the repair pipeline, within which we categorize typical failure symptoms and their prevalence. Secondly, our findings reveal that for all evaluated LLMs, strategy formulation and logic synthesis constitutes the most error-prone stage, followed by problem understanding, whereas localization exhibits the lowest failure rate. This suggests that LLMs may excel at fault localization, a task traditionally regarded as one of the most formidable challenges in automated program repair. Furthermore, we observe that robustness and operational costs (particularly in failure scenarios) vary significantly across different models. Finally, we uncover the root causes of these failures and propose actionable strategies to mitigate them. A particularly notable finding is that existing evaluation harnesses occasionally misjudge correct patches due to superficial discrepancies or hidden constraints. Collectively, our insights may provide promising directions for enhancing the effectiveness and reliability of LLM-based issue resolution.

0

cs.SE 2026-05-13 Recognition

Partial programs control risk in LLM code generation

Uncertainty Quantification for LLM-based Code Generation

A method using hypothesis testing guarantees correct solutions in partial code outputs while cutting removals by 24.5%.

abstract click to expand

Prediction sets provide a theoretically grounded framework for quantifying uncertainty in machine learning models. Adapting them to structured generation tasks, in particular, large language model (LLM) based code generation, remains a challenging problem. An existing attempt proposes PAC prediction sets but is limited by its strong monotonicity assumption on risk and single-label classification framework, which severely limits the space of candidate programs and cannot accommodate the multiple valid outputs inherent to code generation. To address these limitations, we propose an approach RisCoSet that leverages multiple hypothesis testing to construct risk-controlling predictions for LLM-based code generation. Given a trained code generation model, we produce a prediction set represented by a partial program, which is guaranteed to contain a correct solution with high confidence. Extensive experiments on three LLMs demonstrate the effectiveness of the proposed method. For instance, compared with the state-of-the-art, our method can significantly reduce the code removal by up to 24.5%, at the same level of risk.

0

cs.SE 2026-05-13 Recognition

Dataset delivers 449 reproducible locator breaks in web GUI tests

ReproBreak: A Dataset of Reproducible Web Locator Breaks

Mining 359 repositories and validating changes in top projects yields examples of breaks from structural UI edits.

abstract click to expand

Automated GUI testing frameworks such as Cypress and Playwright rely on locators to find and interact with web elements. A locator break occurs when a structural change in the application under test causes a locator to no longer find its target element, resulting in test breakages even when the underlying functionality remains unchanged. Despite its impact on test maintenance, no dataset exists to evaluate locator fragility in Cypress and Playwright at scale. In this paper, we present ReproBreak, a dataset of reproducible locator breaks in web application GUI tests. We analyzed 359 open-source repositories to identify commits that contain locator changes. To confirm whether these changes are indeed locator breaks, we reproduced them in the top 4 projects with the largest number of locator changes and found 449 locator breaks, which are provided in the dataset along with scripts for automated reproduction. We believe ReproBreak serves as a valuable artifact to support research on locator fragility, repair techniques, and test robustness. The video is available at: https://youtu.be/mZByS_TnCvE. The dataset is at https://github.com/rub-sq/ReproBreak.

0

cs.SE 2026-05-13 Recognition

Dataset supplies 2440 proprietary industrial repositories

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

CIDR gives researchers 373 million lines of real production code from 12 companies for model training and quality studies.

abstract click to expand

We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per-repository metadata. Unlike existing code corpora derived from public open-source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy. All repositories were processed through a multi-stage pipeline encompassing structured partner onboarding, two-stage quality selection combining automated metadata filtering with manual code review, and a deterministic anonymization pipeline covering the full version control history. The dataset is intended to support research in code intelligence, software quality analysis, pre-training and fine-tuning of code language models, developer behaviour studies, and construction of agent evaluation benchmarks. Access is provided under a restricted commercial license; details are available at https://fermatix.ai/#Contact.

0

cs.SE 2026-05-13 Recognition

Harness design stabilizes small language models at 95 percent success

It's Not the Size: Harness Design Determines Operational Stability in Small Language Models

Pipeline wrappers lift task rates to 0.952 while raw prompts trigger format collapse and lower scores in 2-3B models.

abstract click to expand

This paper experimentally analyzes how the level of harness engineering affects the operational performance of small language models (SLMs, 2-3B parameters). Three harness conditions - model-only (raw prompt), minimal-shell (wrapper tags), and a 4-stage pipeline (plan->execute->verify->recover) - are applied to three models (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, comparing Task Success Rate (TSR) and Valid TSR (VTSR). The pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B (T1-T5, 21 tasks). A non-monotonic phenomenon - minimal-shell TSR < model-only TSR - is observed in two models. In LLaMA 3.2 3B model-only, seven format violations yield TSR=0.429, revealing scaffold collapse: the model abandons JSON structure under complex format requirements without harness support. Ablation shows planning and recovery each contribute approximately 24.7% of total gain. VCR (Verification Catch Rate)=0.625 across all pipeline runs.

0

cs.SE 2026-05-13 2 theorems

Framework embeds values in CPS human monitoring rules

HM-Req: A Framework for Embedding Values within CPS Human Monitoring Requirements

Controlled natural language plus dashboard flags value clashes early during requirements work for systems that watch people.

abstract click to expand

Monitoring humans, for example, their movement or location, is essential for safe and efficient human-machine collaboration in Cyber-Physical Systems (CPS). This information allows CPS to ensure safety properties, adapt their behaviour dynamically, and coordinate with humans. To ensure that the design of a CPS respects ethical principles and the privacy of its stakeholders, system requirements, particularly those related to human monitoring, must reflect the human values of all involved stakeholders. However, human values are often underrepresented in Software Engineering -- particularly during requirements elicitation and system design, crucial phases when introducing ethically critical functionality. Stakeholder values are often implicit and conflicting, yet rarely systematically captured. Furthermore, unstructured natural language requirements introduce ambiguity and vagueness, complicating conflict resolution. To address these problems, we propose HM-Req, a novel requirements elicitation framework including a Controlled Natural Language (CNL) for defining human monitoring requirements. These requirements are then augmented with human values from relevant stakeholders and integrated into a Value Dashboard to detect potential conflicts that require further discussion and resolution. Validation results, applying the CNL to different datasets and conducting a survey and expert interview, confirms the CNL's ability to capture diverse human monitoring requirements and show HM-Req's usefulness for requirements elicitation activities.

0

cs.SE 2026-05-13 Recognition

Agent decision traces vary up to 43 points in completeness across SDKs

Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes

Pilot classifies Decision Event Schema properties and finds one universal gap plus four regime-specific gaps in reconstructing what agents,

abstract click to expand

Agentic AI failures need post-hoc reconstruction: what the agent did, on whose authority, against which policy, and from what reasoning. Cross-regime feasibility remains unmeasured under one property-level schema. We apply the Decision Trace Reconstructor unmodified to pinned worked-example anchors from six public vendor SDK regimes spanning cloud-agent, observability, tool-use, telemetry, and protocol traces, plus two comparator columns. Each Decision Event Schema (DES) property is classified as fully fillable, partially fillable, structurally unfillable, or opaque. Per-property reconstructability of an agent decision already varies between regimes at this anchor scale. Strict-governance-completeness separates into three tiers ranging from 42.9% to 85.7%, yielding one regime-independent gap (reasoning trace), four regime-dependent gaps, and one Mixed property; the pilot is single-annotator, one anchor per cell, descriptive, with outputs checksum-verifiable from a deposited reproducibility package.

0

cs.SE 2026-05-13 2 theorems

Print statements teach code models to reason step by step

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

Explicitly modeling each execution state reduces inconsistent reasoning and boosts overall code performance.

abstract click to expand

Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.

0

cs.SE 2026-05-13 1 theorem

Value and popularity drive OSS survival

The Death Spiral of Open Source Projects: A Post-Mortem Analysis of Pull Request Workflow Dynamics

Analysis of 1.3 million pull requests shows friction and negativity grow with age but do not cause failure; innovation and ecosystem value (

abstract click to expand

Open Source Software projects (OSS) are central to modern technology, yet their survival rates remain low. Prior research has examined project mortality through macro-level indicators such as commit activity, developer abandonment, and ecosystem dependencies, but the micro-level dynamics of the Pull Request (PR) workflow have been largely overlooked. This study provides the first large-scale post-mortem analysis of PR workflows across 1,736 inactive GitHub repositories and 1.3 million human-driven PRs. Using a mixed-method quantitative design, we investigate three dimensions of mortality. First, our comparative descriptive analysis shows that workflow friction, extended review cycles, and negativity penalties are endemic properties of the entire GitHub platform across both active and inactive projects. Rejected PRs consistently attract higher discussion and negativity regardless of project health. Second, our evolutionary analysis identifies a universal ``death spiral" marked by declining innovation rates, exponential backlog growth, rising merge latency. The collapse was defined by silence and disengagement. Labeling formalization remained endemic throughout the lifecycle, while toxicity did not intensify. Finally, our explanatory modeling demonstrates that project lifespan is not determined by workflow efficiency but by inherent value and ecosystem dynamics. Popularity and innovation emerge as strong positive predictors of survival, while friction, rejection rates, labeling formalization, and negativity scale with longevity as byproducts rather than causes of failure. Robustness checks across alternative inactivity thresholds confirm these findings. Together, this work reframes OSS mortality as a socio-technical phenomenon in which abandonment and ecosystem value dominate survival outcomes, while PR-level workflow discipline plays a secondary role.

0

cs.SE 2026-05-13 Recognition

Bug localization replication fails after fixing data leak

An Extensive Replication Study of the ABLoTS Approach for Bug Localization

An incorrect cut-off date let later bug reports enter training, inflating ABLoTS performance on the original Java projects.

abstract click to expand

Bug localization is the task of recommending source code locations (typically files) that contain the cause of a bug and hence need to be changed to fix the bug. Along these lines, information retrieval-based bug localization (IRBL) approaches have been adopted, which identify the most bug-prone files from the source code space. In current practice, a series of state-of-the-art IRBL techniques leverage the combination of different components (e.g., similar reports, version history, and code structure) to achieve better performance. ABLoTS is a recently proposed approach with the core component, TraceScore, that utilizes requirements and traceability information between different issue reports (i.e., feature requests and bug reports) to identify buggy source code snippets with promising results. To evaluate the accuracy of these results and obtain additional insights into the practical applicability of ABLoTS, we conducted a replication study of this approach with the original dataset and also on two extended datasets (i.e., additional Java dataset and Python dataset). The original dataset consists of 11 open source Java projects with 8,494 bug reports. The extended Java dataset includes 16 more projects comprising 25,893 bug reports and corresponding source code commits. The extended Python dataset consists of 12 projects with 1,289 bug reports. While we find that the TraceScore component, which is the core of ABLoTS, produces comparable or even better results with the extended datasets, we also find that we cannot reproduce the ABLoTS results, as reported in its original paper, due to an overlooked side effect of incorrectly choosing a cut-off date that led to test data leaking into training data with significant effects on performance.

0

cs.SE 2026-05-13 Recognition

SMT-LLM resolves Python deps at 83.6 percent

Breaking the Dependency Chaos: A Constraint-Driven Python Dependency Resolution Strategy with Selective LLM Imputation

Constraint graphs from PyPI data and selective LLM use cut median time from 151 s to 24 s and LLM calls by 11x versus PLLM on HG2.9K.

abstract click to expand

Dependency resolution is the task of selecting package versions that can be installed together without conflicts. It accounts for a significant share of build failures in modern software projects. In the Python ecosystem, this task is especially challenging due to Python 2/3 incompatibilities, deprecated packages, and widespread missing metadata. Recent work, such as PLLM, tackles this problem by using large language models (LLMs) to infer Python and package versions from code and iteratively repairing them based on build errors. We present SMT-LLM, a hybrid system that replaces LLM-only version guessing with formal constraint solving. SMT-LLM uses deterministic import extraction and Python version detection via abstract syntax tree (AST) analysis, the vermin tool to infer minimum Python versions, and a five-tier import-to-package resolver that queries PyPI before any LLM call. We construct a constraint graph from PyPI metadata and LLM-imputed dependencies for packages with missing metadata, then solve for consistent version assignments using a Z3 satisfiability modulo theories (SMT) solver. On the HG2.9K benchmark using Gemma2:9B (10 GB VRAM), SMT-LLM resolves 83.6% of snippets compared to PLLM's 54.8%, while reducing median resolution time from 151.5 s to 23.9 s (6.3x faster) and average LLM calls from ~24.9 to 2.26 per snippet (11x reduction).

0

cs.SE 2026-05-13 Recognition

Seminar sets six research priorities for agents and software engineering

A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar

Experts map short- and long-term directions across governance, architecture, quality, and sustainability to align efforts on agentic AI.

abstract click to expand

The rise of agentic AI is reshaping software engineering in two intertwined directions: agents are increasingly applied to support software engineering tasks, and Agentic AI systems themselves are complex systems that require re-thinking currently established software engineering practices. To chart a coherent research agenda covering the two directions, we organized the A2SE seminar in Rio de Janeiro, bringing together 18 experts from academia and industry. Through structured presentations, collaborative topic clustering, and focused group discussions, participants identified six thematic areas: Governance, Software Engineering for Agents, Agents for Software Architecture, Quality and Evaluation, Sustainability, and Code, and they prioritized short-term and long-term research directions for each. This paper presents the resulting community-driven, opinionated research agenda, offering the SE community a structured foundation for coordinating efforts at this critical juncture.

0

cs.SE 2026-05-13 Recognition

Compiler feedback lifts neural decompilation success to 83.9 percent

Decaf: Improving Neural Decompilation with Automatic Feedback and Search

Search guided by compilation errors corrects semantic mistakes in AI outputs from optimized binaries while preserving source similarity.

abstract click to expand

Decompilers are useful tools used in reverse engineering to understand compiled source code. Reconstructing source code from compiled binaries is a challenging task, because high-level syntax, identifiers, and custom data types are generally lost as the compiler translates human-readable code to low-level machine code. Deterministic decompilers are useful tools for binary analysis, but can struggle to infer idiomatic syntax and identifier names. Generative AI models are a natural fit for reconstructing high-level syntax, identifiers, and types, but they can still suffer by hallucinating improper programming constructs and semantics. Instead of attempting to improve neural decompilers with more data and more training, we argue that compiler feedback can be used to dramatically improve the semantic correctness of neural decompiler outputs via search. Our system, Decaf (DECompilation with Automated Feedback), raises the neural decompilation rate from 26.0% on ExeBench to 83.9% on the Real -O2 split without sacrificing similarity to the original source code. We also find our automatic feedback methodology is highly effective for improving weaker neural decompilation models.

0

cs.SE 2026-05-13 2 theorems

Mined tokens lift LLM flaky test F1-score to 69.34%

NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification

Neuro-symbolic injection halves the performance drop on perturbed tests versus pure neural baselines.

abstract click to expand

Flaky tests, which exhibit non-deterministic pass/fail behavior for the same version of code, pose significant challenges to reliable regression testing. While large language models (LLMs) promise for automated flaky test classification, they often fail to comprehend the actual logic behind test flakiness, instead overfitting to superficial textual artifacts (e.g., specific variable names). This semantic fragility leads to poor generalization on real-world imbalance dataset and vulnerability to perturbations. In this paper, we introduce NeuroFlake, a novel neuro-Symbolic framework for classifying flaky tests on highly imbalanced, real-world datasets (FlakeBench). Unlike prior approaches that rely on brittle manual rule and black box learning, NeuroFlake integrates a Discriminative Token Mining (DTM) module to automate the discovery of high-fidelity, statistically significant source code tokens (e.g., specific concurrency primitives or async waits). By injecting these strong latent signals directly into LLM's attention mechanism, we bridge the gap between neural intuition and symbolic precision. Our experiments demonstrate that neuro-symbolic fusion significantly improves classification performance by leveraging classification F1-score to 69.34% while prior state-of-art shows best F1-score 65.79%. However, we rigorously evaluate NeuroFlake's robustness through adversarial stress testing, introducing semantic preserving augmentations (e.g., dead code injection, variable renaming). While baseline models exhibit performance degradation of 8-18 percentage points (pp) on perturbed tests, NeuroFlake maintains performance stability on unseen augmentations dropping only 4-7 pp.

0

cs.SE 2026-05-12 Recognition

LLMs generate natural language specs to verify code compositionally

Natural Language based Specification and Verification

Preliminary results suggest this route can catch issues in LLM-written programs without rigid formal languages.

abstract click to expand

Recent frontier large language models (LLMs) have shown strong performance in identifying security vulnerabilities in large, mature open-source systems. As LLM-generated code becomes increasingly common, a natural goal is to prevent such models from producing vulnerable implementations in the first place. Formal verification offers a principled route to this objective, but existing verification pipelines typically require specifications written in rigid formal languages. Prior work has explored using LLMs to synthesize such specifications, with limited success. In this paper, we investigate a different approach: using LLMs both to generate specifications and to verify implementations compositionally when the specifications are expressed in natural language. Our preliminary results suggest that this approach is promising.

0

cs.SE 2026-05-12 Recognition

SysML model drives hardware verification directly via server link

SHIA: A Direct SysML-Hardware Interface Architecture for Model-Centric Verification

Logic gate test shows correct bidirectional messages and zero output discrepancy, keeping the original model as the active reference.

abstract click to expand

Model-Based Systems Engineering (MBSE) is widely treated as the backbone of digital engineering, with languages such as the Systems Modeling Language (SysML) providing the means to capture system structure, behaviour, and verification intent. Yet once verification moves to hardware, the system model is routinely left behind. Domain-specific simulation environments, model transformations, and bespoke tool integrations take over, and the model that began as the authoritative reference drifts out of sync with the implementation it was meant to govern. This paper introduces the SysML Hardware Interface Architecture (SHIA), which keeps an executable SysML model directly inside the verification loop, exchanging messages with physical hardware without intermediate transformation chains, co-simulation platforms, or broker-mediated plugins. SHIA is realised through a SysML side server, written in embedded C++ within IBM Rhapsody, and a hardware side server running on a Raspberry Pi, together establishing a bidirectional link between the digital model and the physical system. A logic gate case study demonstrates the approach end-to-end, from hardware model construction and prototype assembly to test harness design, behavioural statechart control, and staged verification of each component before integration. The integrated system exchanged messages correctly in both directions, and Karnaugh map comparison between the SysML-generated and hardware-generated outputs showed zero discrepancy. The result shows that, when paired with a suitable interface, SysML need not remain a static description that informs downstream tools; it can serve as the executable layer through which hardware behaviour is stimulated, observed, and verified. The work demonstrates a route to model-governed verification and a shorter digital thread between system architecture and the hardware that realises it.

0

cs.SE 2026-05-12 Recognition

Code editor plugin logs student sessions for education datasets

Using Logs to support Programming Education

Granular interaction data enables analysis of comprehension, errors, and exercise timing in programming classes.

abstract click to expand

Software developers use metrics to evaluate code quality and productivity, but these practices are still rare in programming education. This project bridges the gap by collecting real-time learning analytics from individual student and whole-class code development logs. This granular, quantitative data provides educators with qualitative insights into the learning process. It allows them to evaluate student comprehension, identify common challenges, and critically assess whether the allocated time for exercises and algorithms is sufficient for mastery. Unlike traditional Learning Management Systems, we propose a novel approach: a plugin for a widely used code editor that captures granular interactions during programming and documentation. The resulting dataset logs coding behaviors, errors, and progress, enabling evidence-based analysis of learning patterns and educational benchmarking. By structuring this real-time programming trail, we support research on teaching methodologies, learner challenges, and skill acquisition. Quantitative metrics complement qualitative assessment by evaluating code, exercise progress, and timestamp logs. Our goal is to provide an open-access database for educators and researchers, fostering data-driven insights to enhance instruction and personalize learning experiences. This work aligns industrial best practices with pedagogical innovation, advancing measurable, empirical approaches to programming education.

0

cs.SE 2026-05-12 Recognition

Pipeline builds dataset of 347 real C++ performance patches

CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits

Testing shows automated tools fix only 13.5 percent of these multi-file changes from mature projects.

abstract click to expand

Recent progress in automated repair of performance bugs demands realistic, executable benchmarks. However, existing C++ performance benchmarks are largely built from competitive programming submissions, and recent real-world benchmarks predominantly target Python and .NET. To fill this gap, we present CppPerf-Mine, a configurable pipeline that mines execution-time-improving patches from open-source C++ repositories on GitHub by combining structural commit filtering, an LLM-based commit classifier, and a containerized build & test stage that produces fully reproducible Docker images for each patch. Using CppPerf-Mine, we build CppPerf-DB, a benchmark comprising 347 manually verified patches from 42 mature C++ repositories, 39% of which are multi-file, enabling the evaluation of repository-level repair tools. In our preliminary study, OpenHands correctly fixes only 13.5% of the patches in CppPerf-DB, confirming that real-world C++ performance repair remains an open challenge. CppPerf-Mine and CppPerf-DB are open-source and publicly available at: https://doi.org/10.5281/zenodo.20097425. In addition, a demonstration video is available at: https://www.youtube.com/watch?v=nixlupIgSdM.

0

cs.SE 2026-05-12 Recognition

AutoSOUP generates unit proofs for component memory safety via LLM hybrid

AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification

The system encodes scope, loop bounds, and environment models automatically so formal proofs of memory safety no longer require manual setup

abstract click to expand

Memory-safety errors remain a persistent source of zero-day vulnerabilities in low-level software. The problem is especially acute in embedded systems, where hardware protections are often limited and dynamic analysis is difficult to apply effectively. Memory-safety verification can provide stronger assurance by proving the absence of such errors or exposing violations when they exist. However, current verification workflows remain largely manual and require substantial specialized expertise, limiting their adoption in practice. We present AutoSOUP, a system for automating component-level memory-safety verification through Safety-Oriented Unit Proofs. We formalize these unit proofs as artifacts that encode verification choices (scope, loop bounds, and environment models) for verifying safety properties, and introduce three techniques for deriving them automatically. To overcome the limitations of existing automation approaches, we further introduce LLM-As-Function-Call, a hybrid architecture that combines deterministic program synthesis with LLMs to automate these techniques and produce justifiable unit proofs. We evaluate AutoSOUP by assessing its ability to automate memory-safety verification and expose vulnerabilities in verified components, and we characterize the assumptions and guarantees of the resulting proofs.

0

cs.SE 2026-05-12 Recognition

AI leaves all problem-solving behaviors intact in code extension tasks

ChatGPT: Friend or Foe When Comprehending and Changing Unfamiliar Code

Lab comparison shows every detailed step appears with or without AI, yet stuck causes and recovery paths differ between the two groups.

abstract click to expand

A rapidly growing body of research is examining how LLMs influence developers when they code. To date, this research has tended to focus on productivity and code quality outcomes, rather than the underlying cognitive processes involved in programming. To address this gap, we report on the results of an exploratory laboratory study of ten advanced student developers (five with support from AI and five without) who had to make a non-trivial extension to a sizable software system. Leveraging Polya's four problem-solving phases and 25 inductively-generated codes detailing distinct problem-solving behaviors as the primary lenses, we examined: (1) how AI impacted the problem-solving approach the developers used to solve the programming task, and (2) how AI impacted their progress when they became stuck. For the analysis, we triangulated data across multiple sources (e.g., think-aloud, code changes, web searches, and LLM prompts). Unexpectedly, while developers in the AI group repeatedly turned to the AI tool to offload certain aspects of the process, all detailed problem-solving behaviors appeared in both groups. We also found that nine out of ten participants found themselves stuck in their work, but with key differences in how they became stuck and unstuck. We highlight seven distinct causes for being stuck and highlight how AI in some cases helped and in other cases hindered becoming unstuck.

0

cs.SE 2026-05-12 Recognition

Autoencoder context compression fails on multi-step coding agents

On Problems of Implicit Context Compression for Software Engineering Agents

In-Context Autoencoder succeeds on single-shot code tasks yet breaks down when agents must plan and edit across multiple steps.

abstract click to expand

LLM-based Software Engineering agents face a critical bottleneck: context length limitations cause failures on complex, long-horizon tasks. One promising solution is to encode context as continuous embeddings rather than discrete tokens, enabling denser information storage. We apply the recently proposed In-Context Autoencoder for this purpose. While the method performs well on single-shot common-knowledge and code-understanding tasks, our experiments demonstrate that it fails on multi-step agentic coding tasks. In this paper, we explore this phenomenon and discuss possible factors contributing to this failure.

0

cs.SE 2026-05-12 Recognition

New benchmark tests agents on cracking binaries from executables

CrackMeBench: Binary Reverse Engineering for Agents

CrackMeBench uses educational tasks in a sandbox to score how well models recover keys and logic from compiled programs.

abstract click to expand

Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent recover validation logic and produce an input, serial, artifact, or key generator accepted by the program? We introduce CrackMeBench, a benchmark for evaluating language-model agents on educational CrackMe-style reverse-engineering tasks. CrackMeBench focuses on deterministic binary validation problems with executable oracles, symbol-poor binaries, explicit local tool access, and externally scored submissions rather than free-form explanations. The v0 benchmark combines eight public calibration CrackMes with twelve generated main-score tasks built from seeded C, Rust, and Go templates, and agents run through an equal shell interface in a no-network Linux Docker sandbox with standard reverse-engineering tools. In a three-model evaluation with a five-minute budget and three scored submissions per task, pass@3 on the generated split is 11/12 tasks (92%) for GPT-5.5, 7/12 (58%) for Claude Opus 4.7, and 5/12 (42%) for Kimi K2. The harder generated half separates the models more sharply, with pass@3 of 5/6, 2/6, and 1/6, respectively; on the eight-task public calibration split, pass@3 is 3/8, 2/8, and 1/8. CrackMeBench records pass@1 and pass@3, scored submissions, wall-clock time, command traces, tool categories, provider-reported token usage, estimated cost, and qualitative failure labels, providing a reproducible testbed for measuring progress from source-code reasoning toward autonomous binary analysis while restricting scope to educational, purpose-built programs.

0

cs.SE 2026-05-12 3 theorems

VISOR automates robot test oracles using vision-language models

VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

Scores task correctness and quality from videos while reporting its own uncertainty, tested on over 1,000 robot videos.

abstract click to expand

Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)-based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test assessments. We evaluated VISOR using two VLMs, i.e., GPT and Gemini, across four robotic tasks on over 1,000 videos. Results show that Gemini achieves higher recall while GPT achieves higher precision. However, both models show low correlation between uncertainty and correctness, which prevents using uncertainty as a correctness predictor.

0

cs.SE 2026-05-12 Recognition

DREAMS tool cuts time for DRM model creation and revision

DREAMS: Modelling Support for Research into Engineering and Artistic Design

Preliminary tests with four users show faster revisions, fewer edge crossings, and quicker evidence retrieval than manual drawing of causal

abstract click to expand

Design Research Methodology (DRM) supports systematic design research through representations such as Reference Models and Impact Models. However, the practical construction and maintenance of these models often remains manual, requiring repeated redrawing, layout adjustment, and separate handling of assumptions, references, and supporting evidence. This can make DRM modelling time-consuming, visually cluttered, and difficult to revise as models increase in complexity. This paper presents DREAMS, an early-stage prototype modelling environment developed to support the creation and maintenance of DRM Reference Models and Impact Models. The tool enables users to construct typed causal models using DRM-relevant elements, define signed causal relationships, and attach assumptions, experiential inputs, and references directly to causal links. It also provides layout support and search functions to improve readability, modifiability, and retrieval of supporting information. A preliminary comparative evaluation with four DRM users was conducted against manual modelling practice. The results indicate reductions in model creation time, revision time, repositioning effort, edge crossings, and evidence retrieval time when using DREAMS. These findings are interpreted as early evidence of practical potential rather than full validation. The contribution of the paper lies in identifying requirements for DRM-aligned modelling support, presenting the design and implementation of DREAMS, and demonstrating its potential to reduce modelling effort and improve traceability in DRM-based research.

0

cs.SE 2026-05-12 Recognition

ReXCL tool automates requirements extraction and classification

Read, Extract, Classify: A Tool for Smarter Requirements Engineering

It combines heuristics and adaptive model fine-tuning to schematize semi-structured documents with better efficiency and accuracy.

abstract click to expand

This paper presents the ReXCL tool, which automates the extraction and classification processes in requirements engineering, enhancing the software development life-cycle. The tool features two main modules: Extraction, which processes raw requirement documents into a predefined schema using heuristics and predictive modeling, and Classification, which assigns class labels to requirements using adaptive fine-tuning of encoder-based models. The final output can be exported to external requirement engineering tools. Performance evaluations indicate that ReXCL significantly improves efficiency and accuracy in managing requirements, marking a novel approach to automating the schematization of semi-structured requirement documents.

0

cs.SE 2026-05-12 Recognition

Margin-aware geometry reduces distortions in imbalanced vulnerability detection

MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection

MARGIN aligns embedding distributions with Voronoi cells via von Mises-Fisher concentration to stabilize decision boundaries on skewed data.

abstract click to expand

Software vulnerability detection is critical for ensuring software security and reliability. Despite recent advances in deep learning, real-world vulnerability datasets suffer from two severe challenges: frequency imbalance and difficulty imbalance. We reinterpret these challenges from an embedding geometry perspective, observing that such imbalances induce geometric distortions in hyperspherical representation space. To address this issue, we propose MARGIN, a metric-based framework that learns discriminative vulnerability representations through adaptive margin metric learning and hyperspherical prototype modeling. MARGIN dynamically adjusts geometric regularization according to the distribution structure estimated by the von Mises-Fisher concentration, aligning the probability mass of embedding distributions with their corresponding Voronoi cells, thereby reducing geometric distortion and yielding more stable decision boundaries. Extensive experiments on public vulnerability datasets show that MARGIN consistently outperforms strong baselines, achieving notable improvements in classification and detection, especially on challenging, imbalanced datasets. Further analysis demonstrates that MARGIN produces more structured embedding geometries, improving robustness, interpretability, and generalization.

0

cs.SE 2026-05-12 Recognition

Config file structure has no effect on coding agent adherence

Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables

Factorial test of four variables shows only session progress and task type produce reliable differences in compliance.

abstract click to expand

Frontier coding agents read configuration files (CLAUDE$.$md, AGENTS$.$md, Cursor Rules) at session start and are expected to follow the conventions inside them. Practitioners assume that structural choices (file size, instruction position, file architecture, contradictions in adjacent files) measurably affect adherence. We report a systematic factorial study of these choices using four manipulated variables, measuring compliance with a trivial target annotation across 1,650 Claude Code CLI sessions (16,050 function-level observations) on two TypeScript codebases, three frontier models (primarily Sonnet 4.6, with Opus 4.6 as a CLI-matched cross-model check and Opus 4.7 reported descriptively under a CLI-version confound), and five coding tasks. We use mixed-effects models with a Bayesian companion. None of the four structural variables or three two-way interactions produces a detectable contrast after multiple-testing correction. Size and conflict nulls are supported by affirmative-null Bayes factors (BF10 between 0.05 and 0.10); position and architecture nulls are failures to reject without Bayes-factor support. The largest effect we measured is within-session: each additional function the agent generates is associated with approximately 5.6% lower odds of compliance per step (OR = 0.944) within the session-length range we tested, though the relationship is non-monotonic rather than a constant per-step effect. This reproduces on a second TypeScript codebase and on Opus 4.6 at matched configuration; it was identified during analysis rather than pre-specified. Within the conditions tested, file-structure variables did not produce detectable contrasts; compliance varies systematically between coding tasks and across each session's sequence of generated functions.

0

cs.SE 2026-05-12 1 theorem

Deterministic orchestration matches LLM accuracy with 3.5x lower costs

Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization

In COBOL-to-Python tests, fixed execution policies deliver stable results and cut token use without losing translation quality.

abstract click to expand

Modernizing legacy COBOL systems remains difficult due to scarce expertise, large and long-lived codebases, and strict correctness requirements. Recent large language model (LLM)-based modernization systems increasingly rely on agentic workflows in which the model controls multi-step tool execution. However, it remains unclear whether delegating execution control to the LLM improves correctness, robustness, or efficiency in structured software engineering workflows. We present a controlled empirical study of deterministic and LLM-controlled orchestration for COBOL-to-Python modernization. Using a unified experimental framework, we hold the language models, prompts, tools, configurations, and source programs constant while varying only the execution control strategy. This isolates orchestration as the sole experimental variable. We evaluate both approaches using functional correctness, robustness across repeated stochastic runs, and computational efficiency. Across multiple models, deterministic orchestration achieves comparable computational accuracy to LLM-controlled orchestration while improving worst-case robustness and reducing performance variability across runs. Deterministic execution also reduces token consumption by up to 3.5x, leading to substantially lower operational cost. These results suggest that, in structured modernization workflows with explicit validation stages, fixed execution policies provide more stable and cost-efficient behavior than fully agentic orchestration without reducing translation quality.

0

cs.SE 2026-05-11 Recognition

Cloning duplicates 60-85% of high-similarity tool pairs in agent ecosystems

Evaluating Tool Cloning in Agentic-AI Ecosystems

Audit of 100k tools across 8,861 repositories shows hidden duplication inflates diversity counts and biases benchmarks.

abstract click to expand

Agent tools are becoming a core interface through which LLM agents access external data, services, and execution environments. As these tools are distributed through public marketplaces, raw tool counts may substantially overstate ecosystem diversity if many repositories are cloned, lightly modified, or derived from shared templates. Such hidden duplication can contaminate benchmark splits, propagate vulnerable implementations, bias measurements of tool-use generalization, and raise provenance, attribution, and intellectual-property concerns. We present, to our knowledge, the first large-scale measurement study of tool cloning in agentic AI ecosystems. We curate a unified dataset from multiple public platforms, covering 7,508 Model Context Protocol (MCP) repositories with 87,564 extracted tools and 1,353 Skills repositories with 12,447 tools, for a total of 8,861 repositories and 100,011 tool entries. To measure implementation-level duplication, we build a repository-level auditing pipeline using complementary lexical and fuzzy-structural similarity metrics, and compute pairwise similarity across MCP-to-MCP, Skills-to-Skills, and MCP-to-Skills repository pairs. We further manually verify 100 sampled pairs per MCP and Skills ecosystem across similarity-score buckets to calibrate how often high similarity reflects true code cloning. Our analysis shows that cloning is not an isolated artifact: high-similarity regions appear across comparison settings, and 60\% of high-Jaccard candidates and 85\% of high-ssdeep candidates in the MCP ecosystem are manually verified as clones. These results indicate that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems. They further suggest that agent-tool datasets and benchmarks should account for repository provenance and implementation similarity when measuring tool diversity or constructing evaluation splits.

0

cs.SE 2026-05-11 Recognition

Shared contract makes agent benchmark gate change controller choice

An Executable Benchmarking Suite for Tool-Using Agents

Clean baseline and live-stressed evaluations pick different variants under the same workload in WebArena Verified study.

abstract click to expand

Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, and provenance under one auditable contract. The gate is decision-relevant rather than merely clerical: in a separate WebArena Verified controller study, clean-baseline and medium live-stressed evaluation select different fixed controller variants under the same workload and admission contract. The release is scoped as a benchmarking suite and admitted evidence, not a new agent policy, model leaderboard, backend comparison, or autonomous SWE-bench solver.

0

cs.SE 2026-05-11 Recognition

GenAI turns software engineering from code writing to intent oversight

From Code-Centric to Intent-Centric Software Engineering: A Reflexive Thematic Analysis of Generative AI, Agentic Systems, and Engineering Accountability

Thematic analysis of literature and discourse finds lowered code costs but rising demands on specification, verification, and governance.

abstract click to expand

Generative artificial intelligence (GenAI) and agentic systems are moving software engineering from code-centric production toward intent-centric human-agent work in which natural language, repository context, tools, tests, and governance shape delivery. Prior studies examine code generation, AI pair programming, and software engineering agents, but less is known about how public technical discourse and peer-reviewed evidence together frame the profession's near-term transition. This study addresses that gap through a reflexive thematic analysis (RTA) dominant and interpretative phenomenological analysis (IPA) informed public-discourse and document analysis. The corpus combines peer-reviewed software engineering and AI literature, technical benchmarks, public talks and interviews, essays, product-facing technical announcements, and X-originated discourse from prominent AI and software engineering voices. Sources were organized through a corpus register, codebook, coding matrix, theme-to-source traceability table, DOI/reference audit, and reproducibility protocol. The analysis shows that GenAI lowers the cost of producing plausible code while increasing the importance of intent specification, context curation, architecture knowledge, verification, security, provenance, governance, and accountable human judgment. The findings indicate that software engineering is becoming less about isolated code authorship and more about supervising, validating, and governing socio-technical systems of humans, agents, tools, and evidence gates. This matters because speed-focused adoption can accumulate hidden technical debt and accountability gaps, whereas bounded autonomy can preserve quality, security, maintainability, and trust.

0

cs.SE 2026-05-11 Recognition

Trajectory context lifts tool accuracy from 39% to 57%

Trajectory Supervision for Continual Tool-Use Learning in LLMs

Retaining full API histories during sequential domain training improves next-call prediction over stripped final-call training in a pilot on

abstract click to expand

Most language-model training data shows final artifacts, not the process that produced them. We study a tractable version of this question in tool use: when a model learns a stream of new API domains, does keeping tool-use trajectories help compared with stripping the intermediate API trace? We fine-tune Llama 3.1 8B Instruct with QLoRA on API-Bank using four sequential domain blocks. Condition A strips previous API request/response lines from the prompt and trains the model to predict the next API call. Condition B keeps the trajectory context. In a single-seed pilot, full held-out generation evaluation shows that Condition B reaches 56.9\% final exact full-call accuracy compared with 39.2\% for Condition A. B also improves final API-name accuracy by 7.7 points. However, B uses 25.1\% more training tokens, the run uses one seed, and the task is next-call prediction rather than full dialogue success.

0

cs.SE 2026-05-11 Recognition

Regional zoom beats global Pareto in 84-89% of SE tasks

Zoom, Don't Wander: Why Regional Search Outperforms Pareto Reasoning and Global Optimization in Budget-Constrained SBSE

Greedy focus on compact optimal islands runs 1000x faster than full exploration under limited budgets.

abstract click to expand

Traditional Search-Based Software Engineering (SBSE) assumes global search and full Pareto exploration are essential. We offer the following negative result based on a study of over 100 Software Engineering (SE) optimization tasks: "zooming" into promising regions is far more effective than Pareto and global exploration under constrained evaluation budgets. Our minimal greedy zoom method, EZR, runs three orders of magnitude faster than Pareto and global Bayesian methods, achieving higher statistical ranks and winning or tying in 84-89\% of datasets on equal budget. Even at one-fifth the evaluation budget, EZR wins or ties in 79-81\% of datasets. Surprisingly, despite never explicitly seeking a frontier, EZR matches or outperforms Pareto methods on their own coverage metrics (IGD, HV) at equal budgets. The explanation for this widespread failure is structural: across the datasets studied, Pareto-optimal solutions form a tiny, tight island concentrated in a compact region of decision space. Methods that wander waste their budgets outside this island. Beyond efficiency, zooming yields small, interpretable models, thus addressing concerns about black-box AI. By replacing global wandering with greedy zooming, we make SBSE much faster, more explicable, and hence accessible to a wider audience. SBSE practitioners and researchers should zoom, not wander.

0

cs.SE 2026-05-11 2 theorems

ConCovUp lifts concurrency test coverage from 37% to 68%

ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing

Multi-agent LLM framework with static analysis and backward tracing generates drivers that hit more shared memory pairs in C/C++ code.

abstract click to expand

Concurrency testing is essential to improve the reliability and security of multi-threaded programs. Dynamic analysis tools, such as TSan, depend on high-quality test drivers that reach critical shared-memory interactions at runtime. However, current testing practices predominantly focus on sequential logic, leaving a gap in automated concurrent test generation. Recently, large language models (LLMs) have shown promise in generating sequential tests, but they struggle to produce effective concurrent tests without a deep understanding of concurrency semantics. This paper presents ConCovUp, a multi-agent framework that combines LLMs with program analysis. ConCovUp grounds test generation in static analysis to extract shared memory accesses and their calling contexts. To trigger hard-to-reach accesses, it introduces an LLM-driven backward tracing approach, leveraging the model's semantic reasoning to deduce concrete inputs that satisfy complex path constraints, and iteratively refines the generated tests via dynamic execution feedback. Our evaluation on nine real-world C/C++ libraries shows that ConCovUp improves average Shared Memory Access Pair Coverage (SMAP Coverage) from 36.6% to 68.1% over the general Claude Code agent baseline.

0

cs.SE 2026-05-11 2 theorems

Multi-agent belief revision verifies code authors without training

MACAA: Belief-Revision Multi-Agent Reasoning for Open-World Code Authorship Verification

Coordinator reconciles layout, syntax and pattern evidence from four agents to handle unknown authors and mixed languages.

abstract click to expand

Code authorship attribution (CAA) supports software forensics, plagiarism detection, and intellectual property protection. However, existing supervised CAA approaches suffer from scarce training data and closed-world assumptions: they require sufficient labeled code from fixed candidate-author sets, making training difficult in low-data cases and predictions unreliable for open-world test pairs with unseen samples, or heterogeneous code pairs. Large language models remove task-specific training, but direct prompting depends on costly expert-designed prompts, can hallucinate over complex heterogeneous code pairs, and rarely yields auditable evidence traces. We propose MACAA, a belief-revision-based multi-agent framework for training-free code authorship verification. MACAA comprises a Coordinator and four Expert Agents analyzing layout, lexical, syntactic, and programming-pattern evidence. The Coordinator gathers expert signals for expansion, discounts unreliable evidence through contraction, and resolves conflicts through revision to preserve belief consistency, replacing direct LLM judgment with auditable hypothesis refinement. MACAA achieves 89.15\% F1 on same-language benchmarks and 80.00\% on mixed cross-language pairs, outperforming all baselines on most benchmarks and remaining competitive on all.

0

cs.SE 2026-05-11 1 theorem

Belief-revision agents verify code authorship without training

MACAA: Belief-Revision Multi-Agent Reasoning for Open-World Code Authorship Verification

Coordinator integrates layout, lexical, syntactic and pattern signals from four experts to reach 89 percent F1 same-language and 80 percent

abstract click to expand

Code authorship attribution (CAA) supports software forensics, plagiarism detection, and intellectual property protection. However, existing supervised CAA approaches suffer from scarce training data and closed-world assumptions: they require sufficient labeled code from fixed candidate-author sets, making training difficult in low-data cases and predictions unreliable for open-world test pairs with unseen samples, or heterogeneous code pairs. Large language models remove task-specific training, but direct prompting depends on costly expert-designed prompts, can hallucinate over complex heterogeneous code pairs, and rarely yields auditable evidence traces. We propose MACAA, a belief-revision-based multi-agent framework for training-free code authorship verification. MACAA comprises a Coordinator and four Expert Agents analyzing layout, lexical, syntactic, and programming-pattern evidence. The Coordinator gathers expert signals for expansion, discounts unreliable evidence through contraction, and resolves conflicts through revision to preserve belief consistency, replacing direct LLM judgment with auditable hypothesis refinement. MACAA achieves 89.15\% F1 on same-language benchmarks and 80.00\% on mixed cross-language pairs, outperforming all baselines on most benchmarks and remaining competitive on all.

0

cs.SE 2026-05-11 Recognition

Ethical safeguards prioritized in cost model for LLM education use

Prediction Model of Motivators and Demotivators of Integrating Large Language Models in Software Engineering Education: An Empirical Study

Analysis of 126 stakeholders finds governance mechanisms offer best trade-off under budget limits for software engineering classes

abstract click to expand

Context: Large Language Models (LLMs) are increasingly influencing software engineering practice and education. While prior studies examine their technical performance and classroom use, limited research provides cost-aware and empirically grounded models for systematic institutional integration. Objective: This study develops and validates a prediction model to identify cost-efficient strategies for integrating LLMs into software engineering education using motivating and demotivating factors. Method: Based on our previously developed literature survey taxonomies [1], we operationalized 19 validated factors (9 motivators and 10 demotivators) into a structured survey completed by 126 stakeholders from multiple countries. Likert-scale responses were encoded and used to train probabilistic models (Naive Bayes and Logistic Regression) to estimate the likelihood of high LLM familiarity. The probability estimates were integrated into a Genetic Algorithm (GA)-based optimization framework to model trade-offs between predicted familiarity and implementation cost at global and category levels. Results: Respondents perceived strong benefits in Programming Assistance and Debugging Support and Personalized and Adaptive Learning. Major concerns included Plagiarism and Intellectual Property Concerns, Over-Reliance on AI in Learning, and Reduced Critical Thinking and Problem Solving. Optimization results indicate that governance-related mechanisms, particularly integrity and ethical safeguards, should be prioritized under cost constraints. Conclusions: The study introduces an optimization-informed decision support framework linking stakeholder perceptions with probabilistic modeling and cost-effort analysis. The model supports staged and cost-aware LLM integration grounded in governance stability and pedagogically meaningful development.

0

cs.SE 2026-05-11 Recognition

Execution traces create first noise-free test for LLM code understanding

An Execution-Verified Multi-Language Benchmark for Code Semantic Reasoning

TraceEval mechanically confirms every call edge in 10k real programs, letting researchers measure how well models recover runtime structure.

abstract click to expand

Evaluating whether large language models (LLMs) can recover execution-relevant program structure, rather than only produce code that passes tests, remains an open problem. Existing code benchmarks emphasize test-passing outputs, from standalone programming tasks (HumanEval, MBPP, LiveCodeBench) to repository repair (SWE-Bench); this is useful, but offers limited diagnostic signal about which program semantics a model can recover from source. We introduce TraceEval, to our knowledge the first execution-verified, multi-language benchmark for code semantic reasoning: recovering a program's runtime call structure from source code. Unlike prior call-graph benchmarks that rely on static-tool output or hand-annotated ground truth, every positive edge in TraceEval is mechanically witnessed by validation execution, eliminating annotator disagreement and label noise for observed behavior. TraceEval consists of (i) 10,583 real-world programs (2,129 test, 8,454 train) extracted from 1,600+ open-source repositories across Python, JavaScript, and Java via an LLM-assisted harness-generation pipeline with tracer validation; and (ii) a reproducible pipeline that converts any open-source repository into new verified benchmark instances. We evaluate 10 LLMs at zero-shot on the held-out test split. The strongest model, Claude-Opus-4.6, reaches an average F1 of 72.9% across the three languages. To demonstrate the train split's utility as a supervision substrate, we fine-tune the Qwen2.5-Coder family on it: lifts of up to +55.6 F1 bring tuned Qwen2.5-Coder-32B to 71.2%, within 1.7 F1 of zero-shot Claude-Opus-4.6. We release the benchmark, pipeline, baselines, and a datasheet at https://github.com/yikun-li/TraceEva

0

cs.SE 2026-05-11 2 theorems

Merlin turns natural language into CodeQL queries that raise accuracy 3.8x

Generating Complex Code Analyzers from Natural Language Questions

The system refines queries iteratively with retrieval and self-tests, letting programmers finish analysis tasks 31 percent faster on large代码

abstract click to expand

Many software development tasks, such as implementing features and fixing bugs, begin with developers posing questions about a codebase. However, answering questions about codebases that span millions of lines of code across thousands of files is non-trivial. Standard tools like grep cannot answer questions requiring semantic or inter-procedural reasoning, and large language models (LLMs) struggle with large codebases due to resource and context constraints. In this paper, we present Merlin, a new system for answering free-form questions that require analytical reasoning about code. Merlin integrates an LLM with CodeQL, a program analysis framework that supports expressive queries over large codebases. We face two principal challenges in the design of such systems: First, program analysis queries are diverse and semantically complex; as a result, even syntactically well-formed queries frequently produce degenerate/empty results. Furthermore, relatively few CodeQL queries are available online, limiting the out-of-the-box effectiveness of LLMs as CodeQL query generators. We address these challenges by developing a RAG-based iterative query-generation approach and a novel self-test technique. Our query debugging technique builds on the idea of assistive queries, which generate concrete witnesses that expose and explain semantic flaws in candidate queries. We evaluate Merlin through both experimental and user studies. Over a set of natural language questions derived from common bug-finding tasks, Merlin discovered not only the majority of software issues reported by other approaches, but also issues that would have otherwise remained undetected. Through a within-subject user study, we found that access to Merlin increased task accuracy by an average of 3.8* and simultaneously reduced the time for programmers to complete all tasks by 31%.

0

cs.SE 2026-05-11 Recognition

Developer reviews expose LLM code flaws missed by benchmarks

Evaluating LLM-Generated Code: A Benchmark and Developer Study

Three-fold method shows developer input reveals production readiness gaps in code from GPT-4.1, DeepSeek-V3 and Claude Opus 4.

abstract click to expand

Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model. However, they primarily focus on measuring solution correctness, leaving other aspects, such as code quality and usability, behind. This paper aims to describe a custom tree-fold evaluation methodology for code generated by Large Language Models that bridges this gap. The methodology includes a dedicated correctness benchmark based on a complex multi-level computer science project, code quality verification, and a survey of developers' opinions on generated code samples gathered through a structured code-review process. The proposed methodology's usage and usefulness are demonstrated by evaluating and comparing three general-purpose Large Language Models: GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4. The results show that reviews gathered from developers can yield many new findings, especially those related to the code being in a production-ready state, that would not be possible to obtain using the standard correctness-focused benchmark approach.

0

cs.SE 2026-05-11 Recognition

Fuzzer finds 64 inconsistencies in Solidity compilers

ParityFuzz: Finding Inconsistencies across Solidity Compilers via Fine-Grained Mutation and Differential Analysis

ParityFuzz mutates contracts and compares normalized outputs across environments to expose differences that affect migration and security.

abstract click to expand

The Solidity smart contract ecosystem has rapidly grown, leading to multiple compilers targeting different blockchain platforms or improving compilation efficiency. Although many compilers aim to be compatible with the primary Solidity compiler (Solc), significant inconsistencies in compilation and execution remain. These inconsistencies hinder contract migration, mislead developers during debugging, and may introduce exploitable vulnerabilities, causing financial losses. Existing testing techniques mainly focus on bugs within a single compiler or perform differential testing in the same execution environment. However, they are insufficient for detecting cross-compiler inconsistencies, as they lack mechanisms to explore triggering conditions and compare bytecode across environments. We propose ParityFuzz, a cross-compiler differential testing framework for Solidity. It operates in three stages. First, it derives mutation rules, including syntax- and boundary-oriented rules, by analyzing compilers and execution environments. Second, it uses reinforcement learning to select effective mutation rules for test generation. Third, it compiles and executes programs across multiple compilers, then normalizes and compares results to detect inconsistencies. Our evaluation shows ParityFuzz is efficient and effective. It achieves up to 18x higher compilation success rate and 1.8x higher code coverage than state-of-the-art fuzzers. It uncovers 64 previously unknown inconsistencies across six compilers. Notably, 11 issues have been fixed, and our findings received a bounty from the Polkadot community.

0

cs.SE 2026-05-11 1 theorem

Semantic distance beats disagreement counts for LLM code uncertainty

Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation

Measuring behavioral severity in sampled programs gives stronger correctness signals and halves runtime across models and languages.

abstract click to expand

LLMs show strong performance in code generation, but their outputs lack correctness guarantees. Sample-based uncertainty estimators address this by generating multiple candidate programs and measuring their disagreement. However, existing estimators make different design choices about how behaviours are identified, aggregated, referenced and compared, making them difficult to assess. We therefore first introduce a taxonomy that disentangles these choices and reveals a missing design point: semantic distance-aware uncertainty estimation, which measures not only whether sampled programs disagree, but how severely their execution behaviours differ. Across LiveCodeBench, MBPP, HumanEval-X and BigCodeBench, spanning Python, Java and C++, our metrics provide strong proxies for correctness, and consistently outperform state-of-the-art sample-based baselines across both closed-source models (GPT-3.5-Turbo, GPT-4o-mini, Gemini-2.5-Flash-Lite, Claude Opus 4.5) and an open-source model (DeepSeek-Coder-V2). The method is practical: it requires neither model internals nor LLM-as-judge calls, remains robust across models, languages, sampling temperatures and fuzzing settings, and reduces runtime by approximately 48-79% relative to existing baselines.

0

cs.SE 2026-05-11 Recognition

Skill drift is contract violation in LLM agent libraries

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

Extracting role-bearing assumptions from skill documents cuts false alarms to zero and raises repair success from 10% to 78%.

abstract click to expand

LLM agents increasingly rely on reusable skill libraries, but these skills silently decay as the external services, packages, APIs, and configurations they reference evolve. Existing monitors detect such changes at the wrong granularity: they observe values, not the role those values play in a skill. A version string in a comment is noise; the same string in a pinned dependency is an operational obligation. We formulate skill drift as contract violation and introduce \sgname{}, which extracts executable environment contracts from skill documents and validates only those role-bearing assumptions against known or live conditions. This distinction turns noisy monitoring into a precision-first maintenance signal. Contract-free CI probes produce 40\% false positives, while \sgname{} raises zero false alarms over 599 no-drift and hard-negative cases (Wilson 95\% CI $[0,0.6]\%$). In known-drift verification, \sgname{} achieves 100\% precision and 76\% recall with the strongest backbone; in a pre-registered study over 49 real skills, it discovers live drift with 86\% conservative precision. Violated contracts also make repair actionable, improving one-round success from 10\% without localization to 78\%. We release \dbname{}, an 880-pair benchmark for skill degradation.

0

cs.SE 2026-05-11 2 theorems

Three-layer gate turns agent failures into bounded fixes

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

PROBE raises diagnosis accuracy to 65% and recovery to 22% on unresolved cases without changing agent policy or tools.

abstract click to expand

Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints.

0

cs.SE 2026-05-11 2 theorems

LLMs mine tactics that let CoqHammer prove 24% more theorems

A Learning Method for Symbolic Systems Using Large Language Models

The method converts LLM reasoning into reusable symbolic code, raising automated success rates on real verification projects without runtime

abstract click to expand

Automated theorem proving is essential for the formal verification of safety-critical systems. As the corpus of formal proofs grows, a natural paradigm is to learn from existing proofs. However, current learning-based approaches predominantly train Large Language Models (LLMs) as end-to-end provers, which yields resource-intensive, opaque systems. Conversely, while traditional symbolic provers are computationally efficient, how to automatically improve these solvers from data remains an open challenge. This paper bridges this gap by proposing LLM2Ltac, the first approach that leverages the reasoning power of LLMs not as end-to-end provers, but as intelligent synthesizers to mine purely symbolic tactics from data. Given a corpus of formal proofs, LLM2Ltac asks an LLM to identify latent proof strategies and formalize them into reusable tactics. These tactics are verified for validity and generalizability, and finally integrated into symbolic provers to enhance their automated proving capabilities without the runtime cost of LLMs. We implement LLM2Ltac on Rocq 8.20.0 and mine tactics from 11,725 theorems in the standard library. We evaluate our approach on 6,199 theorems from four large real-world verification projects, namely, compcert, Coq-Art, Ext-Lib, and VFA. Results show that the mined tactics improve CoqHammer to prove 23.87% more theorems, and when integrating the improved CoqHammer with Claude Code, the overall proved theorems increases by 9.90%, indicating the effectiveness of LLM2Ltac.

0

cs.SE 2026-05-11 2 theorems

Execution fingerprints beat text voting for LLM code

Semantic Voting: Execution-Grounded Consensus for LLM Code Generation

When no oracle exists, running candidates on generated inputs selects better programs than majority vote on the source text.

abstract click to expand

LLM code-generation pipelines often sample multiple candidates and select one final answer without access to a complete oracle. Existing pipelines mix textual voting, ranking, and execution-based agreement, but the relative contribution of each component remains unclear. We study 18 configurations across different models, thinking levels, and benchmarks, comparing output-pattern majority voting, weighted voting, MBR-Exec, and SemanticVote - a method that clusters candidates by execution fingerprints on LLM-generated inputs. Three findings emerge. (1) The best execution-based selector exceeds output-pattern majority voting by 19-52 percentage points on every configuration, with every execution-based selector exceeding it by at least 18 points. (2) Once candidates are executed on diverse inputs, aggregation rule has limited effect: SemanticVote, weighted voting, and MBR-Exec are statistically indistinguishable across all 18 configurations. The largest factor is input quality: sketch-based input generation consistently outperforms direct LLM generation by 0.6-2.1 pp and random fuzzing by up to 11.3 pp. (3) Thinking level interacts differently with selection families: deeper thinking improves majority voting by 12 pp but execution-based methods stay flat or degrade as candidate diversity falls. These results frame inference-time code selection as a signal-quality problem rather than an aggregation-rule problem: when oracles are unavailable, the behavioral evidence matters more than the aggregation rule.

0

cs.SE 2026-05-11 1 theorem

EvidenT repairs 54% of RISC-V package build failures

EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair

By retaining all repair history and using external build feedback, it doubles the success rate of agentic baselines.

abstract click to expand

Frequent toolchain updates and growing ISA diversity have made system-level software package repair increasingly important. Diagnosing and repairing build failures remains challenging because failures involve heterogeneous evidence, dependency constraints, and architecture-specific build conventions. While recent LLM-based repair methods show promise for project-level source fixes, they struggle with system-level repair, where failures span multi-language artifacts such as build recipes, scripts, and source archives, and require iterative validation through external build services. In this paper, we first conduct a systematic empirical study of real-world system-level build failures. We find that 72% of failures stem from dependency and environment misconfigurations rather than isolated code defects, suggesting that effective repair must prioritize packaging logic and iterative feedback. Motivated by these insights, we propose EvidenT, an evidence-preserving repair framework that decouples iteration-aware evidence management from tool execution. EvidenT includes: (1) an external Build Service for reproducible execution and feedback; (2) an Evidence-Preserving Repair Controller that fuses repair history, knowledge context, and build artifacts; and (3) an automated Repair Orchestrator that invokes modular tools for failure localization and system-level repair in a closed-loop validation environment. We evaluate EvidenT on 219 real-world RISC-V package build failures. EvidenT repairs 118 packages (53.88%), outperforming state-of-the-art agentic baselines (20.55%) and direct LLM-based repair (1.83%). To assess architectural generality, we extend EvidenT to legacy ISAs by updating only ISA-specific knowledge context. Preliminary experiments achieve success rates of 41.77% on aarch64 and 46.99% on x86_64, demonstrating robustness across diverse hardware ecosystems.

0

cs.SE 2026-05-11 Recognition

Models reach 92 percent on code but only 5 percent on provable code

VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

Benchmark of 946 problems shows specification and proof writing as the central barriers to verifiable AI-generated software.

abstract click to expand

Large language models can generate useful code from natural language, but their outputs come without correctness guarantees. Verifiable code generation offers a path beyond testing by requiring models to produce not only executable code, but also formal specifications and machine-checkable proofs. Progress in this direction, however, is difficult to measure: existing benchmarks are often small, focus on only one part of the pipeline, lack ground-truth proofs or rigorous specification validation, or target verification settings far from mainstream software development. We present VeriContest, a benchmark of 946 competitive-programming problems from LeetCode and Codeforces for verifiable code generation in Rust with Verus. Each problem pairs a natural language description with expert-validated formal specifications, judge-accepted Rust code, Verus-checked proofs, and positive and negative test suites. VeriContest is constructed through a three-phase pipeline that scales from manually verified seed problems to semi-automated expansion with human-in-the-loop review. To further strengthen benchmark quality, we use testing as an additional quality-assurance layer for validating postcondition completeness. VeriContest supports isolated and compositional evaluation of specification generation, code generation, proof generation, and end-to-end verified program synthesis. Evaluating ten state-of-the-art models reveals a sharp gap between coding ability and verifiable code generation: the strongest model reaches 92.18% on natural-language-to-code generation, but only 48.31% on specification generation, 13.95% on proof generation, and 5.29% end-to-end. These results identify proof and specification generation as the central bottlenecks for models and establish VeriContest as a rigorous platform for measuring and training future systems that generate code with machine-checkable correctness.

0

cs.SE 2026-05-11 2 theorems

Dataset collects 15k configs for AI coding tools

A Dataset of Agentic AI Coding Tool Configurations

From 4,738 open repositories, it tracks how developers guide tools like Claude Code and Cursor with context files and rules.

abstract click to expand

Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration.

0

cs.SE 2026-05-11 Recognition

AI agents omit runtime details in their own technical talks

What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook

On an AI-only network their discussions stress security, trust and workflows but rarely include code artifacts or error reproductions that

abstract click to expand

AI agents are increasingly framed as software-engineering teammates, yet most research studies them inside human-centered workflows. Little is known about the software-engineering discourse autonomous AI agents produce when they interact primarily with one another. This paper examines what autonomous AI agents discuss in MoltBook, an AI-agents-only social network, how that discourse is organized, and how it differs from human developer discourse. We combine human open coding of a 500-post sample, a concentration-plus-check topic-analysis pipeline over 4,707 English-filtered MoltBook technology posts, and a matched-instrument comparison against 5,211 GitHub Discussions posts. MoltBook technology discourse spans 12 recurring themes and is led by Security and Trust (27.4%). At the community level, activity is highly concentrated: the largest submolt contains 63.5% of posts and the Gini coefficient is 0.88, yet a stability-aware BERTopic pipeline still yields 32 non-outlier sub-topics. Compared with the GitHub Discussions baseline, MoltBook discourse contains fewer concrete, context-rich cues such as code-formatted artifacts, environment details, runtime failures, and reproduction steps; social mimicry appears only in a limited way, while idealization is mainly reflected through lower hedging. Overall, AI-only technical discourse is coherent but selective. It repeatedly returns to concerns such as security and trust, memory and context management, tooling and APIs, debugging and error handling, workflow automation, and infrastructure/ops, while omitting much of the concrete runtime and project-local detail common in human developer discourse. This may be because MoltBook contains fewer environment-specific failures, reproduction steps, and other concrete grounding cues.

0

cs.SE 2026-05-11 Recognition

AI agents start most PRs but humans keep merge authority

Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles

Study of 29,585 cases shows collaborator tools let agents drive work while approval stays human across all tools.

abstract click to expand

When AI coding agents open branches and submit pull requests (PRs), two questions co-determine oversight design: who starts the work (operational agency) and who authorizes its completion (merge governance). We characterize tools along a Collaborator-Assistant spectrum in how they redistribute initiative, oversight, and endorsement, while merge governance remains predominantly human across five tools (OpenAI, Copilot, Devin, Cursor, Claude Code). We analyze 29,585 PR lifecycles using an Initiator x Approver taxonomy with six interaction scenarios; lifecycle reconstruction supplies the how behind those roles. Collaborator tools (Cursor, Devin, Copilot) concentrate operational initiative in agents that open and carry PR work forward, with humans retaining review and endorsement on the path to merge; Assistant tools (OpenAI, Claude) leave task direction primarily with humans and supply bounded support within human-led workflows. Across the spectrum, agency and governance decouple: Collaborator workflows are >=96% agent initiated, yet terminal merge authority remains almost exclusively human, with agent-classified approvers confined to a small fraction of PRs. Where automation executes a merge, logs record the executor but not the decision-maker, marking a boundary of observation. We contribute the taxonomy, per-tool state machines, and a replication package for research on automation, oversight, and governance in PR workflows.

0

cs.SE 2026-05-11 Recognition

Similar past faults annotated to guide LLMs in test code

SPARK pulls matching CI cases to mark likely buggy lines, raising accuracy on multi-fault tests without added cost.

abstract click to expand

Software failures remain a major challenge in modern software development, and identifying the code elements responsible for failures is a time-consuming debugging task. While extensive research has focused on fault localization in the system under test (SUT), failures can also originate from faulty system test scripts. This problem, known as Test Code Fault Localization (TCFL), has received significantly less attention despite its importance in continuous integration (CI) environments where large test suites are executed frequently. TCFL is particularly challenging because it typically operates under black-box conditions, relies on limited diagnostic signals such as error messages and partial logs, and involves large system-level test scripts that expand the fault localization search space. In this paper, we propose SPARK, a framework that integrates accumulated debugging knowledge from continuous integration (CI) environments into Large Language Model (LLM)-based TCFL. Given a newly observed failing test case, SPARK retrieves similar fault-labeled test cases from a debugging knowledge corpus and selectively annotates suspicious lines of the failing test based on their similarity to previously observed fault patterns. These annotations guide the LLM's reasoning while maintaining scalability and avoiding the prompt-length explosion common to naive retrieval-augmented approaches. We evaluate SPARK on three industrial datasets containing real-world faulty Python test cases from different software products. The results show that SPARK consistently improves fault localization effectiveness compared to the existing LLM-based TCFL baseline while maintaining comparable inference cost and token usage. In particular, the approach advances the state of the art by identifying more correct faulty locations in complex test cases containing multiple faults.

0

cs.SE 2026-05-11 2 theorems

Trace comparison creates a score for design conformance

Evaluating Design Conformance Through Trace Comparison

Distributed system traces from execution can be aligned with design traces to give a percentage that tracks how well code stays true to its

abstract click to expand

The design of a system and its implementation are two tasks often carried out by different individuals on a development team, and can occur weeks or months apart. This creates a potential for divergence between real behavior and the designed model that an implementation is intended to match. Particularly as time passes and individuals who were present for the original conception of the design leave, a system can lose coherence and drift from intended design principles. Even with a robust system design, more is needed to ensure that the key implementation details match the design and that adherence to a particular strategy is not lost over time. This paper proposes an approach to address that concern for distributed systems using conformance checking, a methodology borrowed from process mining. Distributed traces produced by instrumented applications are evaluated for conformance by comparison to design traces. The resulting conformance percentage is a quantitative metric that can be tracked over time to determine how closely a concrete implementation corresponds to the key attributes of the expected design model. This analysis is done using the dominant industry standard, OpenTelemetry, and so should apply to a wide range of distributed systems.

0

cs.SE 2026-05-11 Recognition

One rules engine powers play

Mazocarta: A Seeded Procedural Deckbuilder for Instrumented Game Development

Mazocarta keeps interactive sessions, simulations, and automated checks on a single deterministic core to avoid drift.

abstract click to expand

Mazocarta is a seeded procedural tactical deckbuilder implemented in Rust, compiled to WebAssembly for browser play, and executable natively for simulation. Its primary technical contribution is not the invention of a new deckbuilding genre, but the construction of an instrumented game-development reference artifact: the same rules engine supports interactive play, native command-line simulation, automated end-to-end tests, save/load fixtures, and local-area multiplayer. This paper describes Mazocarta's architecture, deterministic run model, reproducible balance probes, and QR-mediated WebRTC pairing for local multiplayer. An evaluation snapshot over 1,000 deterministic seeds shows that the simulation pipeline can produce reproducible development signals. In the evaluated configuration, single-player and two-player autoplay win rates were 36.1% and 34.9% over 1,000 deterministic seeds, respectively. These rates are not presented as final player-facing balance metrics, but as repeatable probes for future balance shifts and regressions. Mazocarta is positioned as a playable open-source reference artifact for instrumented game development: deterministic regression checks, automated playtesting workflows, balance probes for game mechanics, and browser-native local multiplayer all exercise one shared production rules core.

0

cs.SE 2026-05-11 Recognition

Bidirectional analysis finds 118 unsafe flows in 87 MCP servers

Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem

Protocol-specific static analysis reveals recurring data-flow risks missed by existing tools across 15k real-world repositories.

abstract click to expand

Model Context Protocol (MCP) have quickly become the interface layer between LLM agents and external tools, yet they also introduce unsafe data flows that existing analyzers handle poorly. Vulnerabilities manifest in two directions: requester-controlled arguments may propagate to sensitive operations, while untrusted external or sensitive internal data may surface through MCP-visible outputs and subsequently influence host or model behavior. Accurate detection is complicated by the heterogeneous registration and dispatch patterns MCP servers employ, the need for MCP-specific taint semantics, and the fact that bugs often only materialize along complete tool-scoped execution paths. We present MCP-BiFlow, a bidirectional static analysis framework built around MCP-aware entrypoint recovery, protocol-specific taint modeling, and interprocedural propagation analysis. Against a benchmark of 32 confirmed MCP vulnerability cases, MCP-BiFlow identifies 30 (93.8% recall), substantially outperforming CodeQL, Semgrep, Snyk Code, and MCPScan. Across 15,452 real-world MCP server repositories, MCP-BiFlow surfaces 549 overlap-compressed candidate clusters; manual review confirms 118 vulnerability paths in 87 servers, establishing unsafe propagation as a recurring failure mode that resists detection without protocol-aware recovery of both request-side and return-side flows.

0

cs.SE 2026-05-11 Recognition

Unified AST labels and graph matching link equivalent code across languages

Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching

Functionally equivalent snippets from different languages are placed near each other in vector space, raising clone-detection F1 to 99.93%

abstract click to expand

The lexical and syntactic disparities among different programming languages (e.g., Java and Python) pose significant challenges for multi-language software engineering tasks such as cross-language code clone detection and code retrieval, since queries or code snippets written in one programming language often fail to match equivalent artifacts in another. To bridge this gap between different programming languages, we proposed a novel approach to construct a multi-language shared semantic space, in which functionally equivalent source code written in different programming languages are close to each other. In this approach, we first map the Abstract Syntax Tree (AST) node labels of the code snippets written in different programming languages into a unified label set, thus compressing high-dimensional language-specific tokens into a common embedding space. Then, we employ a Graph Matching Network (GMN) to encode the paired AST graphs into "semantic vectors" that capture functional equivalence between programming languages in a unified code vector space. In such a way, we can eliminate the differences in syntax between different programming languages. To validate the effectiveness of this approach, we apply it to two downstream tasks, including cross-language clone detection and cross-language code retrieval. Experiments demonstrate that our approach substantially outperforms the state-of-the-art baselines in cross-language clone detection, improving Precision from 95.62% to 99.94%, Recall from 97.72% to 99.92%, and F1 score from 96.94% to 99.93%. In terms of cross-language code retrieval, our approach raises the average Mean Reciprocal Rank (MRR) from 0.4909 to 0.5547, showing an absolute gain of 0.0638 (13% relative improvement), which demonstrates its superior ability to rank correct code snippets high across multiple programming languages.

0

cs.SE 2026-05-11 Recognition

Agents patch code on 35-65% of already-fixed bugs

Coding Agents Don't Know When to Act

FixedBench tests show models default to changes even when none are needed, highlighting an action bias in current training.

abstract click to expand

Coding agents are increasingly deployed to autonomously maintain software, including to resolve user-reported issues: a bug report comes in and the agent creates a patch to address it. However, in any real-world deployment, they will encounter stale bug reports about issues that have already been resolved. Agents should recognize this and abstain from modifying the code to avoid accumulating technical debt. To systematically evaluate whether current agents do so, we introduce FixedBench, a code benchmark with 200 human-verified coding tasks in which no code changes are required, testing five recent models across four agent harnesses. We find that even state-of-the-art models fail, proposing undesirable changes (excluding tests and documentation) in $35$ to $65\%$ of cases. Explicit instructions to reproduce the issue before patching partially address this issue but introduce a new failure mode: when an issue is partially fixed, they abstain even though a patch would still be needed. More broadly, our results indicate that LLMs fall prey to an action bias: they choose to act even if inaction would be appropriate. To break this pattern, inaction needs to be explicitly framed as a path to success, which highlights an overreliance on human guidance implicit in current training objectives.

0

cs.SE 2026-05-11 Recognition

Neuro-symbolic method detects threats in stripped industrial binaries

Securing the Dark Matter: A Semantic-Enhanced Neuro-Symbolic Framework for Supply Chain Analysis of Opaque Industrial Software

Lifting semantics into knowledge graphs improves CVE coverage and cuts false positives on real ICS hardware from multiple vendors.

abstract click to expand

Automated vulnerability detection in critical-infrastructure software confronts a fundamental barrier: industrial software is routinely deployed as stripped, symbol-free binaries that deprive conventional Software Composition Analysis of the source-level transparency it requires. Existing binary analysis techniques close this Semantic Gap only partially -- graph-based detectors preserve structural syntax but discard behavioral semantics, while large language models supply rich semantic cues at the cost of unstable, hallucination-prone inference. To address this gap, we present a semantic-enhanced neuro-symbolic framework that reconstructs behavioral semantics directly from opaque binaries and performs tractable global risk reasoning. Three tightly coupled mechanisms drive this capability: (1) abstract interpretation combined with a reflexive prompting pipeline that structurally constrains a local LLM agent, effectively suppressing hallucinations; (2) a surjective transformation that compresses raw Code Property Graphs into typed Software Supply Chain Knowledge Graphs amenable to scalable reasoning; and (3) a domain-adapted Graphormer that captures long-range vulnerability propagation, augmented by embedding-space subgraph matching to uncover zero-day and APT-style attack patterns. Evaluated across three benchmarks of increasing domain specificity, the framework consistently outperforms all baselines on detection accuracy, semantic lifting fidelity, and APT fingerprint matching. Deployment on a hybrid virtual-physical testbed incorporating production-grade hardware from five ICS vendors further confirms strong detection coverage of high-impact CVEs while substantially reducing false-positive rates relative to leading commercial tools.

0

cs.SE 2026-05-11 2 theorems

SARC enforces agent constraints at runtime for zero hard violations

SARC: A Governance-by-Architecture Framework for Agentic AI Systems

Runtime checks cut soft overages 89.5% versus policy-as-code and extend to multi-agent trace trees

abstract click to expand

Agentic AI systems increasingly act through tools, sub-agents, and external services, but governance controls are still commonly attached to prompts, dashboards, or post-hoc documentation. This creates a structural mismatch in regulated settings: obligations that must constrain execution are often evaluated only after execution has occurred. We introduce SARC, a runtime governance architecture for tool-using agents that treats constraints as first-class specification objects alongside state, action space, and reward. A SARC specification declares each constraint's source, class, predicate, verification point, response protocol, and operating point, and compiles these into four enforcement sites in the agent loop: a Pre-Action Gate, an Action-Time Monitor, a Post-Action Auditor, and an Escalation Router. We formalize the minimal invariants required for specification-trace correspondence, show why finite reward penalties do not generally substitute for hard runtime constraints, and extend the architecture to multi-agent workflows through constraint propagation, authority intersection, and attribution-preserving trace trees. We implement a prototype audit checker and report a reproducible synthetic evaluation over 50 seeds comparing SARC against post-hoc audit, output filtering, workflow rules, and policy-as-code-only baselines on a procurement task. SARC executes zero hard-constraint violations under exact predicates; its declared PAA throttling response reduces soft-window overages by 89.5% relative to policy-as-code-only. Predicate-noise and enforcement-failure sweeps are consistent with the claim that residual hard violations under SARC scale with enforcement-stack error rather than environmental violation opportunity. SARC provides the architectural substrate through which obligations can be made executable, inspectable, and auditable at runtime.

0

cs.SE 2026-05-11 2 theorems

Manifesto puts AI at core of large-scale agile development

The AI-Native Large-Scale Agile Software Development Manifesto

Six principles aim to replace meetings and documents with parallel, intent-driven, AI-orchestrated processes.

abstract click to expand

Despite the widespread adoption of agile methods, achieving true agility at scale remains elusive. Large-scale agile frameworks remain largely human-centric and manual, relying on coordination meetings, artifact synchronization, and role-based handoffs that inhibit real-time adaptation. Meanwhile, rapid advances in AI, particularly large language models, have begun transforming software engineering, yet their potential for organizational-level agility remains underexplored. We present the AI-Native Large-Scale Agile Software Development Manifesto: a set of values and principles that redefine how large-scale software development is organized when AI becomes a first-class participant rather than a peripheral tool. The manifesto is grounded in six principles, parallel processes, intent-driven teams, living knowledge, verification-first assurance, orchestrated agent workforces, and reusable blueprints, that together shift development from a meeting-driven, document-heavy, sequential process to an intelligent, adaptive, continuously learning system.

0

cs.SE 2026-05-11 Recognition

Search tunes LLMs to cut harmful responses

SafeTune: Search-based Harmfulness Minimisation for Large Language Models

Hyperparameter and prompt search on Qwen3.5 0.8B yields large drops in harm and gains in relevance.

abstract click to expand

The widespread adoption of Large Language Models (LLMs) raises concerns about the potential harmfulness of their responses. In this paper, we first investigate the harmfulness of responses from four general-purpose LLMs. Next, we propose SafeTune, a multi-objective search-based approach to mitigate harmfulness while increasing response relevance through hyperparameter tuning and system prompt engineering. Our initial evaluation shows that SafeTune significantly reduces the rate of harmful responses generated by Qwen3.5 0.8B and increases prompt-response relevance (both with a large effect size). Among the parameters we explore, we also find that encouraging greater repetition in responses is most impactful in reducing harmfulness while increasing relevance.

0

cs.SE 2026-05-11 Recognition

RAG with LLMs catches 91 percent of false kernel bug reports

Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel

Study of 2,006 reports finds false positives consume effort comparable to real bugs, concentrated in file systems and drivers.

abstract click to expand

False-positive bug reports represent a significant yet underexplored challenge in the development and maintenance of the Linux kernel. They occur when correct system behavior is mistakenly flagged as a defect, consuming developer effort without leading to actual code improvements. Such reports can mislead developers, waste debugging resources, and delay the resolution of real bugs. In this paper, we present the first comprehensive empirical study of false-positive bug reports in the Linux kernel. We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller. Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time. They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings. To address this challenge, we evaluate large language models (LLMs) for automated false-positive bug report mitigation. Among various prompting strategies, retrieval-augmented generation (RAG) performs best, achieving 91% recall and an F1 score of 88%. These findings highlight the non-negligible cost of false positive bug reports and show the promise of LLMs for more efficient false positive mitigation in the Linux kernel.

0

cs.SE 2026-05-11 Recognition

Natural-language rewrite lifts code retrieval scores

Do not copy and paste! Rewriting strategies for code retrieval

Full transcription of queries and snippets adds up to 0.51 NDCG@10; corpus-only changes degrade results in 62% of cases; entropy shift flags

abstract click to expand

Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest for MoSE-18), while corpus-only rewriting degrades retrieval in 56 of 90 configurations, about 62%. We introduce two diagnostics, Delta H, token entropy, and Delta s, embedding cosine, and show that Delta H predicts retrieval gain under QC across all three rewriter families: pooled Spearman rho = +0.436, p < 0.001 on DeepSeek+Codestral; rho = +0.593 on Codestral alone; rho = +0.356 on Qwen. This establishes Delta H as a cheap, rewriter-agnostic proxy for deciding when rewriting pays off before running retrieval. Our analysis reframes LLM rewriting as a cost-benefit decision: it is most effective as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or NL-heavy queries.

0

cs.SE 2026-05-11 Recognition

Scenario models automate VR app tests and catch more failures

System Test Generation for Virtual Reality Applications using Scenario Models

UltraInstinctVR generates and runs system tests on ten open-source applications, beating prior automated tools in coverage and unique bug fl

abstract click to expand

Virtual Reality (VR) applications are increasingly being integrated across a wide range of domains, including surgical training and industrial marketing. However, the long-term adoption and maintenance of VR applications remain limited, particularly due to the lack of effective, systematic, and reproducible software testing approaches tailored to their unique characteristics. To address this issue, we introduce UltraInstinctVR, a novel testing approach for VR applications. Relying on predefined VR models (scenarios), it automates the generation and execution of concrete VR system tests. In our empirical evaluation, we compare UltraInstinctVR with state-of-the-art automated VR testing approaches in terms of coverage and failure detection on 10 open-source VR applications. The results show that UltraInstinctVR outperforms existing automated tools for detecting unique failures and provides valuable insights for identifying real-world bugs in VR applications.

0

cs.SE 2026-05-11 Recognition

Iterative checks boost LLM quantum solver success

Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation

Stronger models still produce numerically inaccurate outputs instead of crashes, limiting direct use for science.

abstract click to expand

Large Language Models (LLMs) show strong capabilities in code generation, motivating their use in automated quantum solver development. However, in quantum computing, successful execution of generated code is not sufficient: correctness depends on numerically accurate results, which are sensitive to non-trivial mappings, hybrid quantum-classical workflows, and algorithm-specific approximations. This work introduces Q-SAGE, an iterative methodology to evaluate LLMs' capability in generating quantum solvers for scientific problems. The methodology adopts an iterative approach by executing the script generated by the LLM, comparing the result with the result of a classical solver, and refining the script until the two results match within a tolerance threshold. We empirically evaluated the methodology with five families of scientific problems of different complexities and five LLMs, both open source and proprietary. The results show that iterative refinement substantially improves success rates, but introduces a significant computational overhead. Moreover, as model capability increases, failure modes shift from execution errors to numerical inaccuracies, highlighting the current limitations of LLM-based quantum software.

0

cs.SE 2026-05-11 2 theorems

Iterative refinement boosts LLM quantum solver success

Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation

Tests across five problem families show execution errors drop as models improve, yet numerical inaccuracies and extra compute remain.

abstract click to expand

Large Language Models (LLMs) show strong capabilities in code generation, motivating their use in automated quantum solver development. However, in quantum computing, successful execution of generated code is not sufficient: correctness depends on numerically accurate results, which are sensitive to non-trivial mappings, hybrid quantum-classical workflows, and algorithm-specific approximations. This work introduces Q-SAGE, an iterative methodology to evaluate LLMs' capability in generating quantum solvers for scientific problems. The methodology adopts an iterative approach by executing the script generated by the LLM, comparing the result with the result of a classical solver, and refining the script until the two results match within a tolerance threshold. We empirically evaluated the methodology with five families of scientific problems of different complexities and five LLMs, both open source and proprietary. The results show that iterative refinement substantially improves success rates, but introduces a significant computational overhead. Moreover, as model capability increases, failure modes shift from execution errors to numerical inaccuracies, highlighting the current limitations of LLM-based quantum software.

0

cs.SE 2026-05-11 Recognition

Prefill signals from small models locate multi-agent failures

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

MASPrism ranks sources of failure in long traces by extracting negative log-likelihood and attention weights before any output is generated.

abstract click to expand

Failure attribution in LLM-based multi-agent systems aims to identify the steps that contribute to a failed execution. This task remains difficult because a single execution can contain many agent actions and tool calls, failure evidence can appear many steps after the original mistake, and existing methods often rely on costly agent workflows, replay, or training on synthetic failure logs. To address these challenges, we propose MASPrism, a lightweight framework that performs failure attribution using prefill-stage signals from a small language model (SLM). MASPrism first extracts token-level negative log-likelihood and attention weights during a prefill pass to identify symptom-like steps and earlier candidate sources, without decoding. It then reconstructs a focused diagnostic prompt and performs a second prefill pass to rank failure-source candidates. Using Qwen3-0.6B as the SLM, MASPrism achieves the best performance on three of the four evaluated subsets across Who&When and TRAIL, improving Top-1 accuracy on Who&When-HC by 33.41% over the best baseline. On TRAIL, MASPrism outperforms strong proprietary LLMs, including Gemini-2.5-Pro, with up to 89.50% relative improvement. MASPrism processes each trace in 2.66 seconds on average, achieving a 6.69$\times$ speedup over the single-pass prompting baseline, with zero output tokens. These results show that MASPrism provides an effective and practical framework for failure attribution in long multi-agent execution logs.

0

cs.SE 2026-05-11 Recognition

Multi-shot prompts boost agreement only for Claude Haiku

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

Ten-run evaluation shows 0.034 kappa gain only for Claude Haiku; all models over-predict negative feedback while Gemini shows highest var

abstract click to expand

Qualitative analysis plays a pivotal role in understanding the human and social aspects of software engineering. However, it remains a demanding process shaped by the subjective interpretation of individual researchers and sensitive to methodological choices such as prompt design. Recent advancements in Large Language Models (LLMs) offer promising opportunities to support this type of analysis, although their reliability in reproducing human qualitative reasoning under varying prompting conditions remains largely untested. This study presents a controlled empirical evaluation of three LLMs -- Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash -- across two prompt engineering strategies (zero-shot and multi-shot closed coding), using Cohen's kappa as the primary agreement metric over ten independent runs per configuration. Results suggest that multi-shot prompting significantly improves agreement for Claude Haiku (Delta kappa = +0.034, Wilcoxon p = 0.004) but not for DeepSeek-Chat or Gemini 2.5 Flash. Intra-model stability varies substantially -- DeepSeek-Chat and Claude Haiku exhibit the lowest variance (SD approx. 0.017), while Gemini 2.5 Flash is the least stable (SD = 0.038). A systematic over-prediction of "Sharing Negative Feedback" is identified across all models (bias ratios up to 5.25x), alongside consistent under-prediction of "Expressing Concerns." Collectively, these findings provide empirical evidence for prompt engineering guidelines in LLM-assisted qualitative coding for software engineering research.

0

cs.SE 2026-05-11 Recognition

Multi-stage training boosts Java-to-Cangjie code translation 6%

Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair

Syntactic datasets and iterative repair overcome scarce parallel examples for better semantic alignment.

abstract click to expand

With the rapid evolution of emerging programming language ecosystems, the demand for code translation to low-resource languages continues to grow. As Cangjie emerges as a new programming language, its ecosystem and development toolchains are rapidly expanding. Automated translation from popular programming languages to Cangjie is therefore valuable for practical development. However, constrained by both insufficient Cangjie knowledge and scarce parallel code corpora, general Large Language Models (LLMs) are prone to syntactic errors and semantic as well as structural misalignment in code translation. Existing approaches typically rely on fine-tuning with large-scale parallel data, but they cannot reliably improve compilability or semantic consistency for low-resource Cangjie languages. To tackle these challenges, we propose a multi-stage training framework of LLMs that employs the iterative error repair technique to translate Java code into Cangjie code. This training framework performs training on LLMs, gradually integrating knowledge and achieving semantic alignment as well as structure awareness. During the code translation, we also combine the compiler feedback and error repair case retrieval to repair the incorrect Cangjie code. We construct syntactic knowledge and monolingual instruction datasets to train the LLM. In addition, we also build a Cangjie error repair repository to support error repair in our approach. Experimental results show that, with limited parallel data, our approach improves functional equivalence by 6.06\% compared to the state-of-the-art approaches. Meanwhile, ablation studies confirm that each training stage positively contributes to the final performance.

0

cs.SE 2026-05-11 Recognition

Unclear roles top ML team challenges in semiconductors

Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry

Interviews in one global company reveal 16 collaboration issues amplified by hardware constraints, with unclear responsibilities as the top.

abstract click to expand

The integration of machine learning (ML) into complex software systems has increased challenges in collaboration and communication (CoCo) of the teams building these systems. ML engineering (MLE) teams often involve diverse roles, ML engineers, data scientists, software engineers, and domain experts, each bringing unique goals, experiences, and jargon. These interdisciplinary dynamics can make it challenging to deploy, reproduce, and maintain ML-enabled systems over the long term. Previous studies have uncovered several CoCo challenges and practices, but most have focused on software-centric companies, leaving limited empirical understanding of how these dynamics unfold in hardware-centric contexts. In hardware-centric environments, CoCo challenges are shaped by additional constraints such as strict data governance, long development cycles, and tight coupling with physical processes, which amplify coordination complexity and reduce flexibility. To strengthen empirical understanding in such settings, we present a qualitative investigation of MLE teams within a global semiconductor company, where ML-enabled systems and manufacturing processes introduce additional complexity. We interviewed 12 practitioners regarding CoCo practices, tools, challenges, and approaches. Through analysis, we identified 16 recurring challenges, with unclear roles and responsibilities emerging as the most critical, and common practices and recommendations practitioners considered effective in mitigating CoCo problems. While grounded in a single organizational context, our findings align with known issues in interdisciplinary ML-enabled systems development, but also demonstrate how these challenges manifest differently under hardware-driven constraints. Our results highlight directions for future research and tool support to strengthen CoCo in MLE projects and ensure the success of ML-enabled systems.

0

cs.SE 2026-05-11 2 theorems

Open-source low-code editor builds and deploys AI web apps

Low-code and no-code with BESSER to create and deploy smart web applications

BESSER models smart applications at a high level and generates runnable code, avoiding the restrictions of commercial platforms.

abstract click to expand

The increasing demand for web applications containing AI-agents, seen as smart web applications, has prompted the need for new techniques to facilitate their creation. Low-code has risen as an approach that reduces the amount of handwritten code by focusing on the abstraction of components in the form of models combined with automated generators to produce applications. Existing low-code platforms are commercial, leading to drawbacks such as the risk of vendor lock-in, limited extensibility, and more. We present the open-source BESSER low-code framework, which allows users to design, generate and deploy their application via a freely accessible web-based editor, while guaranteeing transparency and extensibility.

0

cs.SE 2026-05-11 Recognition

AI backends gain one admission seam for governance across requests

Execution Envelopes: A Shared Admission Contract for Backend AI Execution Requests

An execution envelope records who asked for what and what was granted, letting shared behavior attach without rebuilding per service.

abstract click to expand

Enterprise AI backends increasingly admit heterogeneous execution requests across model deployment, inference, evaluation, data movement, and agentic workflows. In many systems, those requests arrive in service-specific shapes, which makes it difficult to attach shared admission-time behavior such as logging, governance hints, resource accounting, authorization-aware policy hooks, and later runtime review without rebuilding the same contract in each subsystem. This paper introduces the execution envelope, a normalized internal admission object that records who is asking for what kind of execution, what resources were requested, what policy-relevant scope accompanied the request, and what the backend ultimately granted. The proposal is intentionally narrow. It does not replace service-specific request models, perform scheduling, or introduce a new authority token. Instead, it defines a descriptive admission seam that can be threaded through real backend paths before backend-specific resolution begins. I formalize the distinction between requested and granted resources, specify the field families, invariants, and lifecycle of the envelope, work through POST /serving/deploy_model as an initial proving ground, and position the design relative to usage control, analyzable authorization, admission control, and cluster scheduling. The central claim is that a shared execution-admission contract is a useful missing primitive for modern AI backends because it creates one place to attach governance and observability without pretending to solve placement, policy, and runtime execution in a single step.

0

cs.SE 2026-05-11 2 theorems

Top LLM agents complete only 30-55% of code repositories from scratch

RepoZero: Can LLMs Generate a Code Repository from Scratch?

New benchmark turns generation into reproduction from API specs and shows current models fall short of full software development demands.

abstract click to expand

Large Language Models (LLMs) have recently shown remarkable progress in code generation, yet their ability to construct complete software repositories from scratch remains poorly understood. A fundamental bottleneck is the lack of verifiable and scalable evaluation: existing benchmarks either focus on patch-based editing or rely on human or LLM-based judgments, which introduce bias and limit reproducibility. In this work, we present RepoZero, the first benchmark that enables fully automated, execution-based verification of repository-level generation from scratch. Our key idea is to reformulate generation as repository reproduction: given only API specifications, an agent must re-implement an entire repository such that its behavior matches the original implementation. This design allows for strict black-box validation via output equivalence, while naturally supporting large-scale construction by reusing existing open-source repositories. To further mitigate data leakage and shortcut solutions, we introduce cross-language constraints and a sandboxed evaluation protocol. Building on this benchmark, we propose an Agentic Code-Test Evolution (ACE) framework that performs iterative test generation and error-driven refinement, enabling effective test-time scaling for repository-level synthesis. Extensive experiments across multiple state-of-the-art LLMs and agent frameworks reveal that even the strongest LLM agents achieve only limited pass rates (30\% - 55\%), exposing a substantial gap between current capabilities and real-world software development requirements. Our results establish RepoZero as a challenging, scalable, and reliable testbed for end-to-end code generation, and highlight self-verification via test generation as a critical direction for advancing LLM-based coding agents.

0

cs.SE 2026-05-11 2 theorems

Authority transfer, not task performance, defines agentic CI/CD

From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines

Current systems limit agents to localized data-plane actions and rely on external governance for safety, creating an urgent need for control

abstract click to expand

AI agents are assuming active roles in Continuous Integration and Continuous Deployment (CI/CD) workflows, yet the research community lacks a shared vocabulary for describing what it means for CI/CD to be agentic, how much decision authority is delegated, and where control should reside. This paper presents a vision of agentic CI/CD in which the central challenge is not improving task performance but designing authority transfer, defined as the delegation of operational decisions from human-controlled pipelines to agent systems under specified constraints and recourse mechanisms. To structure this argument, we introduce a distinction between data-plane authority (localized interventions such as patch generation and test reruns) and control-plane authority (modifications to pipeline configuration, deployment policies, and approval gates). Drawing on research prototypes and industrial platforms, we show that current systems operate mainly at the data plane under bounded autonomy, with safety achieved through surrounding governance infrastructure rather than intrinsic agent guarantees. We identify three recurring patterns: constrained autonomy as the dominant design, external governance as the primary safety mechanism, and a widening gap between deployment momentum and evaluation methodology. We propose a research agenda in which control-plane safety and governance mechanisms represent the most urgent open problem, followed by formalization of autonomy boundaries, evaluation frameworks, and human--agent coordination.

0

cs.SE 2026-05-08 Recognition

Replay script matches frontier models on computer-use benchmarks

Computer Use at the Edge of the Statistical Precipice

Blind execution of recorded actions equals the source agent's pass@k rate on static deterministic tests.

abstract click to expand

Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically address. We show that a 1MB replay script that blindly executes a recorded action sequence without ever observing the screen outperforms frontier models on prominent static benchmarks, and prove that its expected success rate is exactly equal to the source agent's pass@k in deterministic environments. We trace this and other failures to two root causes: non-principled environment design (static, unsandboxed, or unreliably verified environments) and non-principled evaluation methodology (naive aggregation and misuse of pass@k for stateful UI interactions). To address the first, we propose PRISM, five design principles for CUA environments (privileged verification, realistic environments, integrity-checked configurations, sandboxed execution, and multifactorial variability) and instantiate them in DigiWorld, a benchmark of 15 realistic sandboxed mobile applications able to evaluate agents in over 3.2 million verified unique configurations. To address the second, we develop an aggregation framework pairing Wilson score intervals with hierarchical bootstrap, producing confidence intervals that correctly account for the nested structure of CUA benchmarks, as we empirically demonstrate. All together, we show that principled environment design and rigorous evaluation methodology are not optional refinements but prerequisites for meaningful CUA research.

0

cs.SE 2026-05-08 Recognition

LLM agents fix under half of architectural code smells

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

On scikit-learn the best agent resolves 47.7 percent while the most aggressive adds 140 new smells

abstract click to expand

Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to $\kappa = 0.94$ expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.

0

cs.SE 2026-05-08 Recognition

LLM agents fix under half of architectural code smells

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

A benchmark on a major Python library shows agents spot false alarms accurately but repair attempts often create new design problems across

abstract click to expand

Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to $\kappa = 0.94$ expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.

0

cs.SE 2026-05-08 2 theorems

This paper reviews studies linking lack of belonging to higher burnout in software…

Guidelines for Cultivating a Sense of Belonging to Reduce Developer Burnout

Guidelines synthesized from prior research on belongingness characteristics and factors to help reduce developer burnout in software…

abstract click to expand

Burnout affects software developers' mental and physical well-being and contributes to turnover, generating strong concerns in the software industry. Prior research has shown that lack of belonging is associated with higher levels of burnout among software developers, while a sense of belonging is linked to resilience, job satisfaction, engagement, and well-being. In this paper, we revisit recent studies on belongingness in software development teams, including proprietary software organizations and open-source software communities, to offer evidence-based guidelines for cultivating belongingness and reducing developer burnout. We summarize characteristics of belongingness, such as trust, acceptance, value recognition, friendship, membership, mutual support, and being known by others, as well as factors associated with belongingness, including recognition, psychological safety, intrinsic motivation, English confidence, tenure, gender, and cultural power distance. Based on these findings, we propose practical guidelines for leaders and communities, including timely and consistent recognition, transparent promotion rules, inclusive benefits and initiatives, intentional connections through collaborative tools, blameless postmortems, optional in-person opportunities, informal newcomer gatherings, and continuous monitoring of belongingness and burnout. These guidelines can help software organizations and open-source communities foster healthier, more inclusive environments that support developer well-being.

0