PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
hub
arXiv: 2511.13646 [cs.SE].URL:https://arxiv.org/abs/2511.13646
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
LLM agents complete over 80% of tasks on a new 849-task Rust verification benchmark and over 90% on unfinished human proofs.
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.
citing papers explorer
-
VeruSAGE: A Study of Agent-Based Verification for Rust Systems
LLM agents complete over 80% of tasks on a new 849-task Rust verification benchmark and over 90% on unfinished human proofs.