pith. machine review for the scientific record. sign in

arxiv: 2604.12301 · v1 · submitted 2026-04-14 · 💻 cs.DC · cs.AI· cs.SE

Recognition: unknown

Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

Elliot Amponsah, Godfred Manu Addo Boakye, Jerry John Kponyo, Justice Owusu Agyemang, Kwame Opuni-Boachie Obour Agyekum

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:26 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.SE
keywords local routingprompt compressioncloud LLMtoken reductioncoding agentshybrid inferencemeasurement study
0
0 comments X

The pith

Local routing combined with prompt compression cuts cloud LLM token use by 45-79% on coding-agent tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests seven tactics that let a small local model handle some of the work before calling a larger cloud model for coding-agent jobs. It finds that sending only the right queries to the cloud and shrinking the prompts first delivers the biggest token reductions on edit and explanation tasks. For retrieval-heavy work, adding local drafting and review brings savings to 51 percent. The best combination changes with the workload type, so the study measures each tactic alone and in groups across four classes of tasks while tracking cost, speed, and accuracy. An open-source layer makes it possible to plug in any local model via Ollama and any cloud endpoint.

Core claim

Local routing (T1) plus prompt compression (T2) produces 45-79% cloud token savings on edit-heavy and explanation-heavy workloads, while the full set of tactics including local drafting with cloud review (T4) reaches 51% savings on RAG-heavy workloads.

What carries the argument

A local model used as a triage layer that performs routing, compression, drafting, or intent extraction before the cloud model is called.

If this is right

  • Token savings vary sharply by workload class, so practitioners should select tactic subsets based on whether tasks are edit-heavy, explanation-heavy, or RAG-heavy.
  • The open-source shim supports any Ollama local model and any OpenAI-compatible cloud endpoint, allowing direct measurement of cost and latency trade-offs.
  • Combining tactics in a greedy-additive way reveals that later tactics add value only after the first two are applied on certain workloads.
  • Dollar cost and latency both drop when cloud tokens fall, though the exact numbers depend on the chosen local and cloud models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams building coding agents may need separate configurations for different user patterns rather than a single default setup.
  • The same triage approach could be tested on non-coding tasks such as data analysis or document editing to see if savings generalize.
  • Smaller local models might still deliver most of the routing and compression benefit, which would lower the hardware bar for deployment.

Load-bearing premise

The local model's choices about which queries to route, how to compress prompts, and what to draft still leave the cloud model with enough information to return correct final answers.

What would settle it

A run on the same workloads where the hybrid outputs contain clear errors in code edits or explanations that a pure cloud model would have avoided.

Figures

Figures reproduced from arXiv: 2604.12301 by Elliot Amponsah, Godfred Manu Addo Boakye, Jerry John Kponyo, Justice Owusu Agyemang, Kwame Opuni-Boachie Obour Agyekum.

Figure 1
Figure 1. Figure 1: Pipeline stage ordering. Each stage either answers the request, transforms it, or passes it [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cloud token savings (%) per tactic in isolation. Error bars show the half-range across two [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cloud token savings (%) for tactic combinations. Error bars show the half-range across [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Token savings vs. dollar cost per 10-sample workload. Points closer to the lower-right are [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

We present a systematic measurement study of seven tactics for reducing cloud LLM token usage when a small local model can act as a triage layer in front of a frontier cloud model. The tactics are: (1) local routing, (2) prompt compression, (3) semantic caching, (4) local drafting with cloud review, (5) minimal-diff edits, (6) structured intent extraction, and (7) batching with vendor prompt caching. We implement all seven in an open-source shim that speaks both MCP and the OpenAI-compatible HTTP surface, supporting any local model via Ollama and any cloud model via an OpenAI-compatible endpoint. We evaluate each tactic individually, in pairs, and in a greedy-additive subset across four coding-agent workload classes (edit-heavy, explanation-heavy, general chat, RAG-heavy). We measure tokens saved, dollar cost, latency, and routing accuracy. Our headline finding is that T1 (local routing) combined with T2 (prompt compression) achieves 45-79% cloud token savings on edit-heavy and explanation-heavy workloads, while on RAG-heavy workloads the full tactic set including T4 (draft-review) achieves 51% savings. We observe that the optimal tactic subset is workload-dependent, which we believe is the most actionable finding for practitioners deploying coding agents today.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript presents a measurement study of seven tactics for reducing cloud LLM token usage in coding-agent workloads using a local model as a triage layer. The tactics include local routing (T1), prompt compression (T2), semantic caching (T3), local drafting with cloud review (T4), minimal-diff edits (T5), structured intent extraction (T6), and batching with vendor prompt caching (T7). Implemented in an open-source shim supporting Ollama and OpenAI-compatible endpoints, the study evaluates individual tactics, pairs, and greedy-additive subsets on four workload classes: edit-heavy, explanation-heavy, general chat, and RAG-heavy. Key measurements include tokens saved, dollar cost, latency, and routing accuracy. The main findings are that T1 combined with T2 achieves 45-79% cloud token savings on edit-heavy and explanation-heavy workloads, while the full tactic set including T4 achieves 51% savings on RAG-heavy workloads, with the optimal subset being workload-dependent.

Significance. If the results hold, this study is significant for providing concrete, workload-specific guidance on cost-saving tactics for deploying coding agents with hybrid local-cloud LLM setups. The open-source implementation of the shim is a strength, enabling reproducibility and extension by practitioners. The emphasis on workload dependence offers actionable insights beyond generic claims.

major comments (1)
  1. [Abstract] Abstract: The headline findings report specific percentage savings (45-79% for T1+T2 and 51% for full set) but the measurements described focus on tokens, cost, latency, and routing accuracy without mention of end-to-end output quality metrics such as task success rates, unit test pass rates, or fidelity scores against a cloud-only baseline. This is load-bearing because the central claim of useful token reduction assumes that local triage, compression, and drafting decisions do not degrade final output correctness.
minor comments (3)
  1. The abstract and evaluation lack details on sample sizes, number of workload instances per class, statistical tests, error bars, or confidence intervals for the reported savings percentages.
  2. It is unclear how routing accuracy was measured and whether it correlates with overall task success.
  3. The selection of tactic subsets appears post-hoc; clarification on whether this was pre-specified or exploratory would strengthen the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for your thorough review and constructive feedback on our manuscript. We appreciate the emphasis on output quality metrics and address the major comment below. We will revise the manuscript to better contextualize our measurements and acknowledge limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline findings report specific percentage savings (45-79% for T1+T2 and 51% for full set) but the measurements described focus on tokens, cost, latency, and routing accuracy without mention of end-to-end output quality metrics such as task success rates, unit test pass rates, or fidelity scores against a cloud-only baseline. This is load-bearing because the central claim of useful token reduction assumes that local triage, compression, and drafting decisions do not degrade final output correctness.

    Authors: We agree that the lack of explicit end-to-end output quality metrics represents a limitation when claiming the practical value of token reductions. This study is a focused measurement of token, cost, latency, and routing accuracy for the seven tactics, with the headline results derived directly from those quantities. The tactics were designed to avoid quality loss where possible (local routing escalates difficult cases to the cloud model; compression targets redundant tokens while retaining intent; draft-review uses the cloud for final output). However, we did not collect task success rates, unit test pass rates, or direct fidelity comparisons to a cloud-only baseline. We will revise the abstract to explicitly note the scope of our measurements, add a limitations subsection discussing the quality-preservation assumption and risks of degradation, and include future-work directions for incorporating quality metrics on coding workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement study with direct token counts

full rationale

This is a measurement study that implements seven tactics in an open-source shim, runs them on four workload classes, and reports observed token savings, costs, latency, and routing accuracy from direct instrumentation. No derivations, equations, fitted parameters, or predictions appear in the abstract or described methodology. The headline claims (45-79% savings from T1+T2, 51% from full set on RAG) are explicit empirical aggregates of measured token counts rather than quantities derived from any self-referential definition, ansatz, or self-citation chain. The central results therefore remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions about representative workloads and local-model triage quality; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The four workload classes (edit-heavy, explanation-heavy, general chat, RAG-heavy) are representative of real coding-agent usage.
    Headline findings are conditioned on these classes.

pith-pipeline@v0.9.0 · 5578 in / 1242 out tokens · 33174 ms · 2026-05-10T15:26:18.611499+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 11 canonical work pages · 7 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

  2. [2]

    Prompt caching with Claude

    Anthropic. Prompt caching with Claude. Anthropic Documentation, 2024. URL https: //docs.anthropic.com/en/docs/build-with-claude/prompt-caching

  3. [3]

    Model context protocol specification, 2024

    Anthropic. Model context protocol specification, 2024. URLhttps://modelcontextprotocol. io

  4. [4]

    LongBench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 3119–3137. Association f...

  5. [5]

    GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings

    Fu Bang. GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. InProceedings of the 3rd Workshop for Natural Language Process- ing Open Source Software (NLP-OSS 2023), pages 212–218. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.nlposs-1.24. URL https://aclanthology.org/2023. nlposs-1.24/

  6. [6]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 5209–5235, 2024

  7. [7]

    FrugalGPT: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research,

    Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research,

  8. [8]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    URLhttps://openreview.net/forum?id=cSimKw5p6R. arXiv:2305.05176

  9. [9]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, et al. Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  10. [10]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://arxiv.org/abs/2407.21783

  11. [11]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023)

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.825. URL htt...

  12. [12]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66. 14

  13. [13]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/leviathan23a.html

  14. [14]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

  15. [15]

    Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

  16. [16]

    Small language models: Survey, measurements, and insights.arXiv preprint arXiv:2409.15790, 2024

    Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D. Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights.arXiv preprint arXiv:2409.15790, 2024. URLhttps://arxiv.org/abs/2409.15790

  17. [17]

    Ollama: Get up and running with large language models locally, 2024

    Ollama. Ollama: Get up and running with large language models locally, 2024. URLhttps: //ollama.com

  18. [18]

    RouteLLM: Learning to Route LLMs with Preference Data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs with preference data. arXiv preprint arXiv:2406.18665, 2024. URLhttps://arxiv.org/abs/2406.18665

  19. [19]

    Prompt caching in the API

    OpenAI. Prompt caching in the API. OpenAI Developer Documentation, 2024. URLhttps: //openai.com/index/api-prompt-caching/

  20. [20]

    Vicky Zhao, Lili Qiu, and Dongmei Zhang

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 963–981, 2024

  21. [21]

    Qwen2.5 Technical Report

    Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  22. [22]

    Sentence-BERT: Sentence embeddings using siamese BERT- networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982–3992. Association for Computational Linguistics, 2019. doi: 10.18653/ v1/D19-1410

  23. [23]

    Efficient Guided Generation for Large Language Models

    Brandon T. Willard and Rémi Louf. Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702, 2023

  24. [24]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 15