arxiv: 2604.22750 · v2 · submitted 2026-04-24 · 💻 cs.CL · cs.AI· cs.CY· cs.HC· cs.SE

Recognition: unknown

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Longju Bai , Zhemin Huang , Xingyao Wang , Jiao Sun , Rada Mihalcea , Erik Brynjolfsson , Alex Pentland , Jiaxin Pei

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.HCcs.SE

keywords AI agentstoken consumptionLLM costsagentic codingSWE-benchtoken efficiencycost predictionmodel comparison

0 comments

The pith

AI agents for coding tasks consume 1000x more tokens than code reasoning or chat, driven by input tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agentic coding workflows are dramatically more token-intensive than simpler LLM uses, with costs varying wildly even on identical tasks and models differing greatly in efficiency. This matters because widespread adoption of agents in workflows will drive massive increases in LLM usage costs, highlighting the need to understand and optimize token economics. Human judgments of task difficulty poorly predict actual costs, and models themselves underestimate their token needs. If these patterns hold, deploying agents will require new strategies for cost control and prediction.

Core claim

In a study of eight frontier LLMs on SWE-bench Verified, agentic coding tasks consume 1000x more tokens than code reasoning and code chat tasks, with input tokens driving the cost. Token usage varies up to 30x for the same task without corresponding accuracy gains, peaks at intermediate costs, and models differ substantially in efficiency. Human-rated task difficulty weakly correlates with actual token use, and models underestimate their token consumption with only weak correlations up to 0.39.

What carries the argument

Measurement and comparison of token consumption trajectories across multiple model runs on standardized agentic coding benchmarks, distinguishing input versus output tokens and relating them to task outcomes.

If this is right

Agent deployment in coding will incur much higher costs than expected from non-agent uses.
Accuracy may not improve with higher token spend beyond a certain point.
Some models offer better value by using fewer tokens for similar performance.
Pre-task token prediction by agents needs improvement to avoid surprises.
Relying on human estimates of difficulty for cost budgeting is unreliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests a need for agent architectures that minimize redundant context or input.
Efficiency optimizations could focus on reducing input token accumulation in multi-step reasoning.
Similar studies on other domains like research agents might reveal domain-specific consumption patterns.
If costs are this variable, runtime monitoring and adaptive strategies become important.

Load-bearing premise

The patterns of token consumption found in SWE-bench Verified agent trajectories apply to other real-world agentic coding tasks and the eight models represent typical frontier system behavior.

What would settle it

Finding that token consumption in agentic coding tasks is comparable to that in code reasoning or chat tasks, or that models accurately predict their token usage with high correlation, would challenge the central findings.

Figures

Figures reproduced from arXiv: 2604.22750 by Alex Pentland, Erik Brynjolfsson, Jiao Sun, Jiaxin Pei, Longju Bai, Rada Mihalcea, Xingyao Wang, Zhemin Huang.

**Figure 1.** Figure 1: Agentic coding tasks cost significantly more tokens than view at source ↗

**Figure 2.** Figure 2: Token cost is highly variable both across problems and across repeated runs of the same problem. (a) Per-instance mean ±1 SD across the four runs and eight models, with instances sorted by mean cost; the heavy right tail indicates that high-cost problems also exhibit the largest cross-run variance. (b) Per-model max/min cost ratio, averaged across the 500 instances; error bars show ±1 SD across instances. … view at source ↗

**Figure 3.** Figure 3: Task accuracy and token cost across models. (a) Group-level accuracy and mean input tokens for each difficulty/model bin; the dashed line shows the overall trend. (b) Relative agent accuracy across cost quartiles, compared to the minimum-cost setting and estimated via mixed-effects regression. When working on the same problem, agent performance peaks at the intermediate-cost run and then saturates with hig… view at source ↗

**Figure 4.** Figure 4: When working on the same problem, high-cost runs are associated with repeated view and edits for the same file. Relative frequency of repeated file modifications (a) and repeated file views (b) across cost quartiles, compared to the minimum-cost setting and estimated via mixed-effects regression. in Figure 2b, accuracy increases modestly from MinCost to LowerCost, but then saturates in higher-cost settings… view at source ↗

**Figure 5.** Figure 5: Expert-rated task difficulty is a weak predictor of agent token consumption. Left: each vertical bar represents one of the 500 SWE-bench tasks, sorted by actual token consumption (low → high) and colored by human difficulty rating. The top reference strip shows the expected coloring under perfect alignment (clean light-to-dark gradient); the actual coloring below it is scrambled throughout. Dashed lines ma… view at source ↗

**Figure 6.** Figure 6: Token efficiency varies substantially across models and reflects model-specific behavior rather than task difficulty. (a) Mean total token usage vs. mean accuracy across all 500 SWE-bench instances; one point per model. (b) Model token usage on the shared success and failure tasks, blue dots show mean token usage on the shared success subset (n=230, solved by all models) and red diamonds show the shared fa… view at source ↗

**Figure 7.** Figure 7: Fine-grained file interaction patterns on the shared success subset. For each model, we report the average number of overall and repeated file view actions (a) and modification actions (b). of Claude Sonnet-4.5 to open up the black box of agent cost and examine where tokens are spent and how that spending translates into dollars. 5.1 Experimentation Setup Most commercial LLM providers charge different type… view at source ↗

**Figure 8.** Figure 8: Phase-level token usage and cost dynamics. Input tokens dominate both raw token usage and dollar cost across phases. We find that cache reads dominate both raw token volume and dollar cost. In every phase, cache-read input tokens are the largest category by a wide margin (Figure 8a), reflecting the cumulative reuse of prior context. Non-cached input and cache-creation tokens track each other closely, consi… view at source ↗

**Figure 9.** Figure 9: Round-level token-cost dynamics of the agent trajectory on astropy astropy7336. The cost of cache read stably increases with the accumulation of the input contexts. Cost spikes are driven by discrete actions that introduce new content (file views, test execution, script generation, final summary). Round Dominant cost source Tool usage Action summary 1 Output think Planning and reasoning about the issue. 1… view at source ↗

**Figure 10.** Figure 10: Self-prediction performance and overhead across models. (a) Pearson correlation between predicted and actual token counts. (b) Overhead of self-prediction, measured as the ratio of prediction cost to actual task cost. Overall, predicting token usage before execution is challenging for all the tested models and there is much space for improving the prediction efficiency. access to the information that dri… view at source ↗

**Figure 11.** Figure 11: Predicted vs. real token usage across models. The dashed diagonal indicates perfect calibration. Agents systematically underestimate both input and output token usage. situation, we define the prediction overhead as the ratio of prediction token cost to actual task token cost and present the result in Figure 10b. For most models, self-prediction is substantially cheaper than execution itself, typically co… view at source ↗

**Figure 12.** Figure 12: Output-token analyses. Top: Group-level accuracy vs. mean output tokens by model, with mixed-effects trends controlling for model identity.Bottom: Mixed-effects coefficients (vs. MinCost) for repeated modify and view actions across cost levels.Higher costs reduce accuracy and sharply increase redundant file operations. B Cost Calculation Details B.1 Explicit Caching Models (Claude Models) Inputnon-cached … view at source ↗

**Figure 13.** Figure 13: Self-prediction behavior without in-context demonstration (Sonnet 4.5 and GPT-5.2). The dashed diagonal indicates perfect calibration. Underestimation remains pronounced. Model Token Corr w/ GT Corr (AbsErr, task cost) Corr (Pred cost, task cost) Sonnet-4.5 Input 0.1355 0.1155 0.1185 Sonnet-4.5 Output 0.1229 -0.4563 0.1185 GPT-5.2 Input 0.1796 0.2243 0.2461 GPT-5.2 Output 0.2130 0.0822 0.2461 view at source ↗

read the original abstract

The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first solid numbers on token burn in agentic coding on SWE-bench, but the 1000x uniqueness claim rests on one benchmark.

read the letter

This paper measures token consumption across eight frontier models on SWE-bench Verified trajectories and compares it to simpler code tasks. It reports that agentic runs use roughly 1000 times more tokens, driven mostly by input, with up to 30x variation on identical tasks and weak self-prediction (correlations up to 0.39). It also shows model efficiency gaps, like Kimi-K2 and Claude-Sonnet-4.5 using over 1.5 million more tokens than GPT-5 on the same work, and that human difficulty ratings barely track actual costs. Those are new empirical observations that were not in the prior literature cited in the abstract. The data on variability and the failure of models to forecast their own usage are the most actionable parts for anyone budgeting agents. The central limitation is that everything comes from SWE-bench Verified with its specific bug-fix tasks and agent scaffolds. Real workflows often involve longer interactions, different tools, or more open-ended goals, so the 1000x multiplier and input dominance may not travel. The paper does not test that extrapolation or provide controls for prompt variation and exact run counts per task. The measurement procedures are described at a high level but lack the statistical detail needed to judge robustness. This work is for researchers and practitioners focused on agent cost modeling and efficiency. It deserves peer review because the measurements are fresh and the questions are practical, even though the generalization step needs more evidence.

Referee Report

3 major / 3 minor

Summary. The paper conducts the first systematic empirical study of token consumption patterns in agentic coding tasks. Analyzing trajectories from eight frontier LLMs on SWE-bench Verified, it claims that (1) agentic tasks are uniquely expensive, consuming ~1000x more tokens than code reasoning or code chat tasks with input tokens driving costs; (2) usage is highly stochastic (up to 30x variation on identical tasks) and accuracy peaks at intermediate rather than high costs; (3) models differ substantially in efficiency (e.g., Kimi-K2 and Claude-Sonnet-4.5 average >1.5M more tokens than GPT-5); (4) human expert difficulty ratings align only weakly with actual token costs; and (5) models fail to self-predict token usage (correlations ≤0.39) and systematically underestimate costs.

Significance. If the quantitative patterns hold and the comparisons are robust, the work supplies valuable data on the economics of agentic systems, model token efficiency differences, and the limits of cost prediction. It could inform agent design, deployment decisions, and research on cost-aware agents. The observation of a gap between human-perceived complexity and actual computational effort is a useful contribution to understanding agent behavior.

major comments (3)

[Methods] Methods section: The abstract and results report precise quantitative claims (1000x multiplier, 30x variability on same tasks, 1.5M token differences, correlations up to 0.39) but provide no details on number of runs per task, statistical tests, controls for prompt variation, or exact token measurement procedures (e.g., inclusion of system prompts, tool outputs, or retries). These omissions are load-bearing for verifying the central empirical findings.
[Results] Results (token consumption comparisons): The claim that agentic tasks are 'uniquely expensive' with a 1000x multiplier over code reasoning and code chat is central but rests on unspecified baseline tasks, models, scaffolds, and measurement protocols for the non-agentic conditions. Without these, it is unclear whether the multiplier and input dominance generalize or are specific to the chosen comparators.
[Discussion] Discussion: The observed patterns (input-driven costs, high variability, poor self-prediction) are derived exclusively from SWE-bench Verified trajectories with particular agent scaffolds. The manuscript should provide evidence or explicit caveats on why these would hold for other real-world agentic coding workflows that may differ in task openness, interaction length, or tool usage.

minor comments (3)

[Abstract] Abstract: Model names should be standardized and match the exact identifiers used in experiments (e.g., 'Claude-Sonnet-4.5').
[Figures] Figures: Token usage plots should display full distributions or variability metrics (e.g., boxplots, percentiles) rather than averages alone to illustrate the reported 30x stochastic differences.
[Related Work] Related work: Include additional citations to prior empirical studies on LLM token costs or efficiency in non-agentic settings to better situate the novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity, reproducibility, and scope. We address each major comment point by point below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Methods] Methods section: The abstract and results report precise quantitative claims (1000x multiplier, 30x variability on same tasks, 1.5M token differences, correlations up to 0.39) but provide no details on number of runs per task, statistical tests, controls for prompt variation, or exact token measurement procedures (e.g., inclusion of system prompts, tool outputs, or retries). These omissions are load-bearing for verifying the central empirical findings.

Authors: We agree that the Methods section lacked sufficient detail to support the reported quantitative claims. In the revised manuscript we have substantially expanded this section to specify the number of runs per task, the statistical tests performed (including how variability and comparisons were assessed), controls for prompt variation, and the precise token measurement protocol (covering system prompts, tool outputs, and handling of retries). These additions directly address the reproducibility concerns. revision: yes
Referee: [Results] Results (token consumption comparisons): The claim that agentic tasks are 'uniquely expensive' with a 1000x multiplier over code reasoning and code chat is central but rests on unspecified baseline tasks, models, scaffolds, and measurement protocols for the non-agentic conditions. Without these, it is unclear whether the multiplier and input dominance generalize or are specific to the chosen comparators.

Authors: We acknowledge that the non-agentic baselines were insufficiently described. We have revised the Results and Methods sections to explicitly define the baseline tasks, the models and scaffolds used for them, and the consistent measurement protocols applied across agentic and non-agentic conditions. This clarification supports the 1000x multiplier and input-token dominance within the experimental setup we employed. revision: yes
Referee: [Discussion] Discussion: The observed patterns (input-driven costs, high variability, poor self-prediction) are derived exclusively from SWE-bench Verified trajectories with particular agent scaffolds. The manuscript should provide evidence or explicit caveats on why these would hold for other real-world agentic coding workflows that may differ in task openness, interaction length, or tool usage.

Authors: We agree that the Discussion should address generalizability. We have added explicit caveats in the revised Discussion section acknowledging that the reported patterns come from SWE-bench Verified with the specific agent scaffolds used. We discuss why core observations (such as input-driven costs arising from accumulating context and tool calls) are likely to be relevant more broadly, while noting that differences in task openness, interaction length, and tool usage in other workflows may affect the magnitude of these effects. We also outline directions for future validation on additional agentic coding settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this empirical observational study

full rationale

The paper reports direct measurements of token consumption on SWE-bench Verified trajectories across eight models, along with observed variability, accuracy correlations, model efficiency differences, and LLMs' self-prediction performance (via correlations up to 0.39). No equations, derivations, or first-principles results reduce any reported quantity to a fitted parameter or input defined by the same data. The 'prediction' component evaluates frontier models' ability to forecast their own token use before execution, assessed externally rather than through author-side fitting that would force the outcome. No self-citations are load-bearing for central claims, and the analysis remains self-contained against the benchmark data without renaming or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical runs of agent trajectories on a standard benchmark; the main unstated premise is that SWE-bench Verified tasks and the chosen agent scaffolds produce representative token-usage statistics for broader agentic coding.

axioms (1)

domain assumption SWE-bench Verified tasks and the agent scaffolds used produce representative token-usage statistics for real-world agentic coding
The study uses this benchmark to draw general conclusions about agent token consumption.

pith-pipeline@v0.9.0 · 5663 in / 1298 out tokens · 59448 ms · 2026-05-08T11:34:28.540354+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MEMOA: Massive Mixtures of Online Agents via Mean-Field Decentralized Nash Equilibria
cs.LG 2026-05 unverdicted novelty 5.0

Derives unique closed-form decentralized policy minimizing worst-agent online regret that asymptotically converges to centralized Nash-optimal policy in mean-field limit, with added online mixture weighting.

Reference graph

Works this paper leans on

26 extracted references · cited by 1 Pith paper

[1]

3.Testing (for estimation purposes): • You may run existing tests to understand failures and complexity

Analysis:Consider multiple solution approaches and estimate token costs for each phase. 3.Testing (for estimation purposes): • You may run existing tests to understand failures and complexity. • Do not create new tests or modify existing tests. • If tests cannot be run, estimate based on code analysis. 4.Estimation: • Synthesize findings into a token usag...
[2]

Look for dependency files (e.g., requirements.txt, pyproject.toml, package.json)
[3]

Install dependencies using those files if present
[4]

Troubleshooting • If issues arise during exploration or testing:

Install individual packages only if no dependency files exist. Troubleshooting • If issues arise during exploration or testing:
[5]

Document the issue in your reasoning
[6]

Factor added complexity into token estimates
[7]

predicted_input_tokens

Continue estimation with available information. Token Estimation Output Format { "predicted_input_tokens": <integer>, "predicted_output_tokens": <integer>, "predicted_total_tokens": <integer>, "confidence": <float between 0 and 1>, "breakdown_by_phase": { "repo_cloning": { ... }, "initial_reading": { ... }, "test_setup": { ... }, "debugging": { ... }, "co...
[8]

Understanding the problem across a large codebase with 50+ files
[9]

Reading and analyzing multiple files that use the version comparison function
[10]

Finding an alternative solution that maintains backward compatibil- ity
[11]

Updating all affected modules and ensuring no regressions
[12]

Running comprehensive test suites (unit, integration, functional)
[13]

Multiple debugging iterations to handle edge cases
[14]

predicted_input_tokens

Verifying the fix works across all integration points <function=finish> <parameter=message> 22 {"predicted_input_tokens": 1250000, "predicted_output_tokens": 18500, "predicted_total_tokens": 1268500, "confidence": 0.82, "breakdown_by_phase": {"repo_cloning": {"input_tokens": 5000, "output_tokens": 800, "total_tokens": 5800, "description": "Exploring large...

2000
[15]

Read the problem description and understand what needs to be fixed
[16]

Explore relevant files and directories to understand the codebase structure
[17]

Search for key functions, classes, or variables related to the issue
[18]

Analysis:Analyze the complexity and required changes

Identify the root cause and complexity of the problem Phase 2. Analysis:Analyze the complexity and required changes
[19]

Assess the scope of changes needed (number of files, lines of code)
[20]

Consider the debugging iterations likely needed
[21]

Evaluate the testing complexity and iterations
[22]

Token Estimation:Calculate token usage for the complete solution

Estimate the number of tool calls and reasoning steps Phase 3. Token Estimation:Calculate token usage for the complete solution
[23]

Estimate input tokens for: • Repository exploration and file reading • Code analysis and debugging • Implementation iterations • Testing and verification
[24]

Estimate output tokens for: • Reasoning and analysis responses • Code generation and explanations • Debugging responses • Test results interpretation
[25]

Finish:Provide final token estimate

Calculate total tokens and confidence level Phase 4. Finish:Provide final token estimate
[26]

Do not write actual code fixes or modify any files

Call the finish tool with a JSON object containing: •predicted input tokens •predicted output tokens •predicted total tokens •confidence(0–1) •breakdown by phase ReminderRemember: You are estimating COSTS, not implementing SOLUTIONS. Do not write actual code fixes or modify any files. Your final deliverable is a JSON token estimate, not a working solution...