arxiv: 2604.13107 · v1 · submitted 2026-04-10 · 💻 cs.SE · cs.AI· cs.LG

Can Coding Agents Be General Agents?

Maksim Ivanov , Abhijay Rana , Gokul Prabhakaran This is my paper

Pith reviewed 2026-05-10 16:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords coding agentsgeneral agentsbusiness process automationERP systemsdomain logiccode executionagent generalizabilitytask complexity

0 comments

The pith

Coding agents complete simple business tasks but fail on complex ones because they cannot reliably bridge domain logic with code execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether coding agents that excel at software engineering can extend to general business process automation in an Enterprise Resource Planning system. A case study shows reliable success on straightforward tasks but repeated characteristic breakdowns on intricate multi-step processes. The authors identify the separation between understanding business rules and generating correct executable code as the central obstacle. Readers should care because closing this gap would determine whether coding agents can evolve into versatile tools for real-world operations rather than remaining specialized code generators.

Core claim

In evaluations of a coding agent on practical business tasks inside an open-core ERP system, the agent reliably completes simple tasks but exhibits characteristic failures on complex tasks. These failures arise when the agent must integrate domain-specific business logic with code execution, revealing a key bottleneck that prevents coding agents from generalizing to end-to-end business process automation.

What carries the argument

Case study evaluation of coding agent performance on simple versus complex tasks in an open-core ERP system, used to expose the domain-logic-to-code-execution gap.

If this is right

Agents will need explicit mechanisms to connect business rules to code before they can automate end-to-end processes.
Existing benchmarks miss the domain-logic integration problems that appear in real business systems.
Simple-task success does not predict performance on tasks requiring coordinated logic across modules.
General agent use cases will remain limited until the domain-code bridge is addressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar bottlenecks may appear in other non-coding domains where agents must apply specialized rules alongside code.
Hybrid designs that pair coding agents with separate reasoning modules for domain logic could be tested as a direct remedy.
Repeating the case study across different ERP platforms or agent architectures would clarify whether the pattern is widespread.

Load-bearing premise

The failures seen in this single open-core ERP case study are representative of coding agents in general rather than specific to the agent, the chosen tasks, or the system.

What would settle it

An experiment in which the same or similar coding agents successfully complete multiple complex multi-step business processes in the ERP system without the identified domain-logic failures would refute the bottleneck claim.

Figures

Figures reproduced from arXiv: 2604.13107 by Abhijay Rana, Gokul Prabhakaran, Maksim Ivanov.

**Figure 1.** Figure 1: The coding agent is an emergent agent architecture, which allows a model to create its own code tools at runtime and iterate with the help of code outputs and error logs. Experimentally, past research has shown that training models on code improves general reasoning. For instance, Aryabumi et al. find that adding code data to pre-training yields up to 8.2% relative improvement on natural language reasoning… view at source ↗

**Figure 2.** Figure 2: The critical evaluation gap: Code evaluations lack business context; business evaluations lack code interaction. 4.1 Setup An ERP is the single source of truth for businesses’ operations, supporting more than 3.8 million companies worldwide and, by some estimates, underpinning 77% of the world’s transaction revenue. We identified the ERP as our ideal testing ground for “general” business tasks because it c… view at source ↗

**Figure 3.** Figure 3: An example of a process in an ERP, involving an order-to-delivery workflow. 4.2 Environment We built a fictional company inside Odoo 19.0 Community Edition, the open-core ERP with over 12 million users across SMBs and large enterprises. Odoo is particularly well suited for this research: unlike most proprietary business apps, it is free to self-host, easy to spin up in many parallel sandboxed instances, an… view at source ↗

**Figure 4.** Figure 4: A visualization of the decisions present in a simplified scenario. The solution to this task requires multiple choices and execution steps within the ERP. 4.4 Agent Harness We developed a simple coding agent harness for our trials. The harness includes only a bash tool, which is how the agent writes and executes code scripts to interact with the Odoo environment. We provide the agent with database credenti… view at source ↗

**Figure 5.** Figure 5: We test Claude Sonnet 4.5 and GPT-5 on 20 scenarios across a gradient of difficulty from Easy to Hardest. We observe that increasing the complexity of the scenarios by increasing the number of constraints results in a breakdown of performance. We separate our evaluation of each scenario into 4 dimensions: Constraint Resolution, Resource Optimization, Traceability, and Policy Adherence. Noticeably, GPT-5 te… view at source ↗

**Figure 6.** Figure 6: The agent hallucinates that damaged LED boards must be stored in the fridge, without any supporting evidence in the system, leading to functional-but-misleading code. we observe that at the business level, feedback is sparse. The inbuilt ERP guardrails reject obviously invalid and ill-formatted interactions but won’t flag suboptimal choices or policy violations. Only when we run the task verifier do we see… view at source ↗

**Figure 7.** Figure 7: The agent disregards the policy requirement to only schedule consecutive days off, and incorrectly claims successful completion. 7 Conclusion Coding agents represent a promising path toward general-purpose AI. They are already widely adopted, rapidly improving, and operate in the same medium through which most white-collar work already flows: software. After examining performance in our simulations, we fin… view at source ↗

read the original abstract

As coding agents have seen rapid capability and adoption gains, users are applying them to general tasks beyond software engineering. In this post, we investigate whether coding agents can successfully generalize to end-to-end business process automation. We identify gaps in current evaluations, and conduct a case study to evaluate a coding agent on practical business tasks in an open-core Enterprise Resource Planning system. We find that the agent reliably completes simple tasks but exhibits characteristic failures on complex tasks, suggesting that bridging domain logic and code execution is a key bottleneck to generalizability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A single ERP case study flags a plausible domain-logic bottleneck for coding agents but lacks the controls or metrics needed to treat the failures as general.

read the letter

The paper's core observation is that a coding agent handles straightforward tasks in an open-core ERP but runs into repeated problems on more involved business processes, which the authors tie to the difficulty of connecting domain rules with code execution. That framing is reasonable as a practical alert for anyone trying to move these agents into real enterprise workflows. What stands out as new is the choice of end-to-end ERP automation as the test bed; most agent papers stay inside software-engineering benchmarks, so this target is fresher even if the underlying idea of task-complexity limits is not. The write-up also usefully calls out gaps in how current evaluations are structured. Beyond that, the contribution is mainly observational. The soft spots are the obvious ones for a case study of this kind. There are no task counts, success definitions, baseline comparisons, or ablation checks in the abstract, and the full text does not appear to add quantitative controls or multiple systems. Without those, the claim that the failures are characteristic of coding agents rather than specific to the chosen agent, task set, or ERP data model rests on an assumption of representativeness that is not tested. The evidence is therefore too thin to support strong generalization. This is the sort of note that would interest people building or deploying agents for business logic, but it is still early-stage. A reader already following the agent-evaluation literature would pick up the ERP-specific failure modes and the call for better domain integration, yet would not treat the results as settled. I would send it to peer review so the authors can supply the missing methodological detail and perhaps run a small controlled follow-up; the topic is timely enough that referees could help shape it into something more usable.

Referee Report

2 major / 2 minor

Summary. The paper investigates whether coding agents can generalize beyond software engineering to end-to-end business process automation. It presents a case study evaluating one coding agent on practical tasks in a single open-core ERP system and concludes that the agent reliably succeeds on simple tasks but shows characteristic failures on complex tasks, attributing this to a bottleneck in bridging domain logic with code execution.

Significance. An empirically grounded identification of integration bottlenecks between domain knowledge and executable code would be useful for guiding agent research toward more generalizable systems. The work provides an initial observational data point in the under-explored area of business-process automation, but the single-system design limits the strength of any generalizability claims.

major comments (2)

[Case study and results sections] The inference that observed failures are 'characteristic' and indicate a general 'bridging domain logic and code execution' bottleneck for coding agents rests on a single open-core ERP case study. No controls (different agents, different domains, or ablation of the domain-logic component) are reported, so the attribution cannot be distinguished from artifacts of the chosen agent, task set, or ERP data model/API surface. This directly undermines the central claim in the abstract and discussion.
[Evaluation methodology and results] No task counts, success criteria, quantitative metrics, or statistical controls appear in the reported evaluation. Without these, it is impossible to determine whether the 'reliable' success on simple tasks or the 'characteristic failures' on complex tasks are robust or merely anecdotal.

minor comments (2)

[Abstract] The abstract would be strengthened by briefly stating the number of tasks, the definition of success, and the agent/system versions used.
[Case study description] Clarify how tasks were selected and categorized as 'simple' versus 'complex' to allow readers to assess potential selection bias.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. The work is presented as an initial case study exploring the generalization of coding agents to business process automation, and we appreciate the opportunity to clarify its scope and strengthen the reporting of results. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Case study and results sections] The inference that observed failures are 'characteristic' and indicate a general 'bridging domain logic and code execution' bottleneck for coding agents rests on a single open-core ERP case study. No controls (different agents, different domains, or ablation of the domain-logic component) are reported, so the attribution cannot be distinguished from artifacts of the chosen agent, task set, or ERP data model/API surface. This directly undermines the central claim in the abstract and discussion.

Authors: We agree that the study is limited to a single coding agent and one open-core ERP system, and we do not claim to have established a general bottleneck applicable to all coding agents. The manuscript is explicitly framed as a case study whose goal is to surface observable patterns in agent behavior on practical business tasks. The phrase 'characteristic failures' is used to describe recurring failure modes (e.g., inability to correctly map domain concepts to executable operations) that appeared consistently across the complex tasks we examined; these are documented with concrete examples in the results. We will revise the abstract, discussion, and a new limitations subsection to qualify the language, replacing stronger claims with statements such as 'observations from this case study suggest that bridging domain logic and code execution may be a bottleneck' and explicitly noting the absence of cross-agent or cross-domain controls. This revision will be made without altering the core empirical observations. revision: partial
Referee: [Evaluation methodology and results] No task counts, success criteria, quantitative metrics, or statistical controls appear in the reported evaluation. Without these, it is impossible to determine whether the 'reliable' success on simple tasks or the 'characteristic failures' on complex tasks are robust or merely anecdotal.

Authors: The evaluation section currently emphasizes qualitative description of task outcomes and failure modes rather than aggregate statistics. We accept that this presentation makes it difficult for readers to gauge the scale and consistency of the results. In the revised manuscript we will add an explicit 'Evaluation Setup' subsection that states: (1) the total number of tasks attempted, (2) the breakdown into simple versus complex categories, (3) the precise success criteria applied (completion of the end-to-end business process with correct data state and no manual intervention required), and (4) the observed success counts per category. These counts will be presented in a table. Because the study was designed as an observational case study rather than a controlled experiment, we did not pre-specify statistical tests; we will note this limitation while still providing the raw counts so readers can form their own assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational case study with no derivations or fitted predictions

full rationale

The paper reports results from a single open-core ERP case study, observing that a coding agent succeeds on simple tasks but shows failures on complex ones. It offers an interpretive suggestion that bridging domain logic and code execution is a bottleneck, but this is not derived from any equations, parameters, or formal chain that reduces to inputs by construction. There are no mathematical predictions, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The work is self-contained as an empirical report; its claims rest on direct task outcomes rather than any internal reduction. This matches the default non-circular outcome for observational studies without derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical content; the work rests on the standard assumption that task complexity can be meaningfully categorized as simple versus complex and that one ERP system is informative about general business automation.

pith-pipeline@v0.9.0 · 5379 in / 964 out tokens · 37425 ms · 2026-05-10T16:29:29.625500+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 5 internal anchors

[1]

Unified pre-training for program understanding and generation.arXiv preprint arXiv:2103.06333,

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation.arXiv preprint, arXiv:2103.06333, 2021

work page arXiv 2021
[2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. Evaluating large language models trained on code.arXiv preprint, arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint, arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Introducing swe bench verified.OpenAI Blog, 2024

OpenAI-Research. Introducing swe bench verified.OpenAI Blog, 2024

work page 2024
[5]

Terminal bench.TerminalBench Leaderboard, 2024

TerminalBench-Team. Terminal bench.TerminalBench Leaderboard, 2024

work page 2024
[6]

@anthropic-ai/claude-code.npm package, 2025

Anthropic-Research. @anthropic-ai/claude-code.npm package, 2025

work page 2025
[7]

Claude code is my computer.Blog post, 2025

Peter Steinberger. Claude code is my computer.Blog post, 2025

work page 2025
[8]

Building agents with the claude agent sdk.Anthropic Engineering Blog, 2024

Anthropic-Research. Building agents with the claude agent sdk.Anthropic Engineering Blog, 2024. 9 Can Coding Agents be General Agents?

work page 2024
[9]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint, arXiv:2302.04761, 2023

work page internal anchor Pith review arXiv 2023
[10]

To code, or not to code? exploring impact of code in pre-training

Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Ustun, and Sara Hooker. To code, or not to code? exploring impact of code in pre-training.arXiv preprint, arXiv:2408.10914, 2024

work page arXiv 2024
[11]

Executable code actions elicit better LLM agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents.arXiv preprint, arXiv:2402.01030, 2024

work page arXiv 2024
[12]

Gpt-5.1 codex max.OpenAI Blog, 2025

OpenAI-Research. Gpt-5.1 codex max.OpenAI Blog, 2025

work page 2025
[13]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. Tau-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint, arXiv:2406.12045, 2024

work page internal anchor Pith review arXiv 2024
[14]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models.UC Berkeley, 2024

Gorilla-Team. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models.UC Berkeley, 2024

work page 2024
[15]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, and Evan Hubinger. Sycophancy to subterfuge: Investigating reward tampering in language models.arXiv preprint, arXiv:2406.10162, 2024

work page internal anchor Pith review arXiv 2024
[16]

Richard S. Sutton. The bitter lesson.Essay, 2019. 10

work page 2019