Recognition: unknown
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Pith reviewed 2026-05-08 11:34 UTC · model grok-4.3
The pith
AI agents for coding tasks consume 1000x more tokens than code reasoning or chat, driven by input tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a study of eight frontier LLMs on SWE-bench Verified, agentic coding tasks consume 1000x more tokens than code reasoning and code chat tasks, with input tokens driving the cost. Token usage varies up to 30x for the same task without corresponding accuracy gains, peaks at intermediate costs, and models differ substantially in efficiency. Human-rated task difficulty weakly correlates with actual token use, and models underestimate their token consumption with only weak correlations up to 0.39.
What carries the argument
Measurement and comparison of token consumption trajectories across multiple model runs on standardized agentic coding benchmarks, distinguishing input versus output tokens and relating them to task outcomes.
If this is right
- Agent deployment in coding will incur much higher costs than expected from non-agent uses.
- Accuracy may not improve with higher token spend beyond a certain point.
- Some models offer better value by using fewer tokens for similar performance.
- Pre-task token prediction by agents needs improvement to avoid surprises.
- Relying on human estimates of difficulty for cost budgeting is unreliable.
Where Pith is reading between the lines
- This suggests a need for agent architectures that minimize redundant context or input.
- Efficiency optimizations could focus on reducing input token accumulation in multi-step reasoning.
- Similar studies on other domains like research agents might reveal domain-specific consumption patterns.
- If costs are this variable, runtime monitoring and adaptive strategies become important.
Load-bearing premise
The patterns of token consumption found in SWE-bench Verified agent trajectories apply to other real-world agentic coding tasks and the eight models represent typical frontier system behavior.
What would settle it
Finding that token consumption in agentic coding tasks is comparable to that in code reasoning or chat tasks, or that models accurately predict their token usage with high correlation, would challenge the central findings.
Figures
read the original abstract
The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts the first systematic empirical study of token consumption patterns in agentic coding tasks. Analyzing trajectories from eight frontier LLMs on SWE-bench Verified, it claims that (1) agentic tasks are uniquely expensive, consuming ~1000x more tokens than code reasoning or code chat tasks with input tokens driving costs; (2) usage is highly stochastic (up to 30x variation on identical tasks) and accuracy peaks at intermediate rather than high costs; (3) models differ substantially in efficiency (e.g., Kimi-K2 and Claude-Sonnet-4.5 average >1.5M more tokens than GPT-5); (4) human expert difficulty ratings align only weakly with actual token costs; and (5) models fail to self-predict token usage (correlations ≤0.39) and systematically underestimate costs.
Significance. If the quantitative patterns hold and the comparisons are robust, the work supplies valuable data on the economics of agentic systems, model token efficiency differences, and the limits of cost prediction. It could inform agent design, deployment decisions, and research on cost-aware agents. The observation of a gap between human-perceived complexity and actual computational effort is a useful contribution to understanding agent behavior.
major comments (3)
- [Methods] Methods section: The abstract and results report precise quantitative claims (1000x multiplier, 30x variability on same tasks, 1.5M token differences, correlations up to 0.39) but provide no details on number of runs per task, statistical tests, controls for prompt variation, or exact token measurement procedures (e.g., inclusion of system prompts, tool outputs, or retries). These omissions are load-bearing for verifying the central empirical findings.
- [Results] Results (token consumption comparisons): The claim that agentic tasks are 'uniquely expensive' with a 1000x multiplier over code reasoning and code chat is central but rests on unspecified baseline tasks, models, scaffolds, and measurement protocols for the non-agentic conditions. Without these, it is unclear whether the multiplier and input dominance generalize or are specific to the chosen comparators.
- [Discussion] Discussion: The observed patterns (input-driven costs, high variability, poor self-prediction) are derived exclusively from SWE-bench Verified trajectories with particular agent scaffolds. The manuscript should provide evidence or explicit caveats on why these would hold for other real-world agentic coding workflows that may differ in task openness, interaction length, or tool usage.
minor comments (3)
- [Abstract] Abstract: Model names should be standardized and match the exact identifiers used in experiments (e.g., 'Claude-Sonnet-4.5').
- [Figures] Figures: Token usage plots should display full distributions or variability metrics (e.g., boxplots, percentiles) rather than averages alone to illustrate the reported 30x stochastic differences.
- [Related Work] Related work: Include additional citations to prior empirical studies on LLM token costs or efficiency in non-agentic settings to better situate the novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity, reproducibility, and scope. We address each major comment point by point below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Methods] Methods section: The abstract and results report precise quantitative claims (1000x multiplier, 30x variability on same tasks, 1.5M token differences, correlations up to 0.39) but provide no details on number of runs per task, statistical tests, controls for prompt variation, or exact token measurement procedures (e.g., inclusion of system prompts, tool outputs, or retries). These omissions are load-bearing for verifying the central empirical findings.
Authors: We agree that the Methods section lacked sufficient detail to support the reported quantitative claims. In the revised manuscript we have substantially expanded this section to specify the number of runs per task, the statistical tests performed (including how variability and comparisons were assessed), controls for prompt variation, and the precise token measurement protocol (covering system prompts, tool outputs, and handling of retries). These additions directly address the reproducibility concerns. revision: yes
-
Referee: [Results] Results (token consumption comparisons): The claim that agentic tasks are 'uniquely expensive' with a 1000x multiplier over code reasoning and code chat is central but rests on unspecified baseline tasks, models, scaffolds, and measurement protocols for the non-agentic conditions. Without these, it is unclear whether the multiplier and input dominance generalize or are specific to the chosen comparators.
Authors: We acknowledge that the non-agentic baselines were insufficiently described. We have revised the Results and Methods sections to explicitly define the baseline tasks, the models and scaffolds used for them, and the consistent measurement protocols applied across agentic and non-agentic conditions. This clarification supports the 1000x multiplier and input-token dominance within the experimental setup we employed. revision: yes
-
Referee: [Discussion] Discussion: The observed patterns (input-driven costs, high variability, poor self-prediction) are derived exclusively from SWE-bench Verified trajectories with particular agent scaffolds. The manuscript should provide evidence or explicit caveats on why these would hold for other real-world agentic coding workflows that may differ in task openness, interaction length, or tool usage.
Authors: We agree that the Discussion should address generalizability. We have added explicit caveats in the revised Discussion section acknowledging that the reported patterns come from SWE-bench Verified with the specific agent scaffolds used. We discuss why core observations (such as input-driven costs arising from accumulating context and tool calls) are likely to be relevant more broadly, while noting that differences in task openness, interaction length, and tool usage in other workflows may affect the magnitude of these effects. We also outline directions for future validation on additional agentic coding settings. revision: yes
Circularity Check
No significant circularity in this empirical observational study
full rationale
The paper reports direct measurements of token consumption on SWE-bench Verified trajectories across eight models, along with observed variability, accuracy correlations, model efficiency differences, and LLMs' self-prediction performance (via correlations up to 0.39). No equations, derivations, or first-principles results reduce any reported quantity to a fitted parameter or input defined by the same data. The 'prediction' component evaluates frontier models' ability to forecast their own token use before execution, assessed externally rather than through author-side fitting that would force the outcome. No self-citations are load-bearing for central claims, and the analysis remains self-contained against the benchmark data without renaming or smuggling ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SWE-bench Verified tasks and the agent scaffolds used produce representative token-usage statistics for real-world agentic coding
Forward citations
Cited by 1 Pith paper
-
MEMOA: Massive Mixtures of Online Agents via Mean-Field Decentralized Nash Equilibria
Derives unique closed-form decentralized policy minimizing worst-agent online regret that asymptotically converges to centralized Nash-optimal policy in mean-field limit, with added online mixture weighting.
Reference graph
Works this paper leans on
-
[1]
3.Testing (for estimation purposes): • You may run existing tests to understand failures and complexity
Analysis:Consider multiple solution approaches and estimate token costs for each phase. 3.Testing (for estimation purposes): • You may run existing tests to understand failures and complexity. • Do not create new tests or modify existing tests. • If tests cannot be run, estimate based on code analysis. 4.Estimation: • Synthesize findings into a token usag...
-
[2]
Look for dependency files (e.g., requirements.txt, pyproject.toml, package.json)
-
[3]
Install dependencies using those files if present
-
[4]
Troubleshooting • If issues arise during exploration or testing:
Install individual packages only if no dependency files exist. Troubleshooting • If issues arise during exploration or testing:
-
[5]
Document the issue in your reasoning
-
[6]
Factor added complexity into token estimates
-
[7]
predicted_input_tokens
Continue estimation with available information. Token Estimation Output Format { "predicted_input_tokens": <integer>, "predicted_output_tokens": <integer>, "predicted_total_tokens": <integer>, "confidence": <float between 0 and 1>, "breakdown_by_phase": { "repo_cloning": { ... }, "initial_reading": { ... }, "test_setup": { ... }, "debugging": { ... }, "co...
-
[8]
Understanding the problem across a large codebase with 50+ files
-
[9]
Reading and analyzing multiple files that use the version comparison function
-
[10]
Finding an alternative solution that maintains backward compatibil- ity
-
[11]
Updating all affected modules and ensuring no regressions
-
[12]
Running comprehensive test suites (unit, integration, functional)
-
[13]
Multiple debugging iterations to handle edge cases
-
[14]
predicted_input_tokens
Verifying the fix works across all integration points <function=finish> <parameter=message> 22 {"predicted_input_tokens": 1250000, "predicted_output_tokens": 18500, "predicted_total_tokens": 1268500, "confidence": 0.82, "breakdown_by_phase": {"repo_cloning": {"input_tokens": 5000, "output_tokens": 800, "total_tokens": 5800, "description": "Exploring large...
2000
-
[15]
Read the problem description and understand what needs to be fixed
-
[16]
Explore relevant files and directories to understand the codebase structure
-
[17]
Search for key functions, classes, or variables related to the issue
-
[18]
Analysis:Analyze the complexity and required changes
Identify the root cause and complexity of the problem Phase 2. Analysis:Analyze the complexity and required changes
-
[19]
Assess the scope of changes needed (number of files, lines of code)
-
[20]
Consider the debugging iterations likely needed
-
[21]
Evaluate the testing complexity and iterations
-
[22]
Token Estimation:Calculate token usage for the complete solution
Estimate the number of tool calls and reasoning steps Phase 3. Token Estimation:Calculate token usage for the complete solution
-
[23]
Estimate input tokens for: • Repository exploration and file reading • Code analysis and debugging • Implementation iterations • Testing and verification
-
[24]
Estimate output tokens for: • Reasoning and analysis responses • Code generation and explanations • Debugging responses • Test results interpretation
-
[25]
Finish:Provide final token estimate
Calculate total tokens and confidence level Phase 4. Finish:Provide final token estimate
-
[26]
Do not write actual code fixes or modify any files
Call the finish tool with a JSON object containing: •predicted input tokens •predicted output tokens •predicted total tokens •confidence(0–1) •breakdown by phase ReminderRemember: You are estimating COSTS, not implementing SOLUTIONS. Do not write actual code fixes or modify any files. Your final deliverable is a JSON token estimate, not a working solution...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.