Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.
arXiv preprint arXiv:2502.05352 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
The central challenge in AI-augmented CI/CD is designing authority transfer from humans to agents under constraints, as current systems remain limited to bounded data-plane autonomy backed by external governance.
The paper introduces Experiment-as-Code Labs as a declarative stack synthesizing AI agents, systems orchestration, and physical lab control for AI-driven discovery.
citing papers explorer
-
Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks
Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
Runtime-Structured Task Decomposition for Agentic Coding Systems
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
-
From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines
The central challenge in AI-augmented CI/CD is designing authority transfer from humans to agents under constraints, as current systems remain limited to bounded data-plane autonomy backed by external governance.
-
Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery
The paper introduces Experiment-as-Code Labs as a declarative stack synthesizing AI agents, systems orchestration, and physical lab control for AI-driven discovery.