{"total":10,"items":[{"citing_arxiv_id":"2605.18747","ref_index":185,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ToolCoder [19] Function-oriented API search tools API selection via trigger prediction Grounds generation in retrieved APIs CodeQA [224] Function-oriented API/doc query tools Tool-augmented API QA Retrieves API evidence for coding RAG-for-Code [225] Function-oriented Repo, docs, API Retrieval-augmented context Knowledge for long-tail libraries CodeAgent [185] Environment-interaction Repo files, tests Repo navigation, editing, validation Repo-level coding via environment interaction SWE-agent [57] Environment-interaction Shell, editor, repo, tests Agent-computer interface loop Resolves GitHub issues via shell commands AgentCoder [50] Verification-driven Test generation Programmer-tester-executor loop Refines code via generated tests"},{"citing_arxiv_id":"2605.15222","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization","primary_cat":"cs.SE","submitted_at":"2026-05-13T08:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11388","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deep Reasoning in General Purpose Agents via Structured Meta-Cognition","primary_cat":"cs.CL","submitted_at":"2026-05-12T01:21:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13346","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentSPEX: An Agent SPecification and EXecution Language","primary_cat":"cs.CL","submitted_at":"2026-04-14T23:16:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11535","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems","primary_cat":"cs.AI","submitted_at":"2026-04-13T14:32:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05955","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution","primary_cat":"cs.SE","submitted_at":"2026-04-07T14:47:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"APIs rather than implement complex business logic or intricate functional interactions. To extend benchmarking toward more com- plex scenarios, subsequent benchmarks such as SWE-bench Pro [8] and SWE-Lancer [26] focus onenterprise-level software, where issue resolution involves longer horizons and greater difficulty. Comple- menting these efforts, SWE-bench Multimodal [37] augments the original benchmark with issues that includevisual elements(e.g., bug screenshots), thereby evaluating models' ability to interpret and act on information presented across both textual and visual modalities. The evolution of these benchmarks reflects a clear shift toward evaluating LLM/agent capabilities in more realistic enter- prise software maintenance practices."},{"citing_arxiv_id":"2603.22048","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dynamic analysis enhances issue resolution","primary_cat":"cs.SE","submitted_at":"2026-03-23T14:48:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.07900","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents","primary_cat":"cs.SE","submitted_at":"2026-02-08T10:26:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.18552","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Training Superintelligent Software Agents through Self-Play SWE-RL","primary_cat":"cs.SE","submitted_at":"2025-12-21T00:49:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.18436","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VeruSAGE: A Study of Agent-Based Verification for Rust Systems","primary_cat":"cs.OS","submitted_at":"2025-12-20T17:22:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM agents complete over 80% of tasks on a new 849-task Rust verification benchmark and over 90% on unfinished human proofs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}