pith. sign in

arxiv: 2605.26165 · v1 · pith:OABJLWU2new · submitted 2026-05-24 · 💻 cs.SE · cs.AI· cs.CL

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

Pith reviewed 2026-06-29 23:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords tool-schema compressionagentic RAGcontext budgetJSON schemasexact matchtool definitionsRAG systemsconstrained context
0
0 comments X

The pith

Tool-schema compression restores agentic RAG at 8K context budgets where uncompressed schemas overflow entirely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic RAG systems that attach many tool definitions run into a direct conflict because the schemas occupy the same context window required for retrieval. The paper tests this trade-off across 14 models and thousands of calls at 8K, 16K, and 32K budgets using 28 tool definitions. It shows that a conservative compression profile cuts schema tokens by 44-50 percent, turning a near-total failure at 8K tokens into usable performance with an average 20.5-point exact-match gain. At 32K tokens, where both formats fit, the gap shrinks to one point or less, confirming the gain is produced by the budget constraint rather than any change in model capability. The work therefore positions schema compression as a required step for running tool-using agents inside limited context windows.

Core claim

Applying TSCG conservative-profile compression to JSON tool schemas produces 44-50 percent token savings. At an 8K-token budget the uncompressed schemas overflow the window and yield only 2.6 percent average exact match, while the compressed versions restore functionality with a 20.5-point average lift across eight models. At 32K tokens the two formats produce statistically identical results, showing the benefit is strictly budget-driven. The same compression also raises the number of operable tools from roughly 494 to more than 800 before overflow occurs.

What carries the argument

TSCG conservative-profile compression, a method that reduces the token count of JSON tool schemas by 44-50 percent while keeping the information required for correct tool interpretation and invocation.

If this is right

  • At any context budget where uncompressed schemas overflow, the compressed versions restore RAG functionality.
  • When the budget is large enough for both formats, performance remains essentially unchanged.
  • The same compression extends the operable tool count from approximately 494 to beyond 800.
  • External checks on multi-hop questions confirm a 48-point exact-match gain under the overflow condition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same compression technique could be applied to any agent system that must keep large numbers of tool or API definitions inside a fixed context window.
  • Hardware platforms with smaller native context lengths become viable for tool-augmented agents once schemas are compressed.
  • Further gains might appear if compression is combined with retrieval methods that also prune irrelevant tool descriptions.

Load-bearing premise

The compression must preserve every detail the model needs to select and call the right tools without introducing new errors or capability loss.

What would settle it

A controlled run at 8K tokens in which models given the compressed schemas produce lower exact-match scores than models given the uncompressed schemas because they misinterpret or fail to invoke the tools.

Figures

Figures reproduced from arXiv: 2605.26165 by Furkan Sakizli.

Figure 1
Figure 1. Figure 1: Binary enablement at 8K context with 28 tools. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Frontier scaling with Sonnet 4 at 200K context. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K -- where both formats fit -- four of five tested models show delta <= 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript claims that Tool-Schema Compression (TSCG) with conservative-profile compression achieves 44-50% token savings on JSON tool schemas, producing a binary enablement effect for agentic RAG: at 8K context, uncompressed schemas overflow and yield near-zero exact-match (2.6% average), while compressed schemas restore functionality with +20.5 pp average EM lift (and +24.7 pp on the six models showing full enablement). The effect is isolated as budget-driven by 32K controls (delta <=1 pp on four of five models), external HotpotQA validation (+48 pp), and scaling tests (JSON overflows at ~494 tools; compressed remains operational beyond 800). Results rest on 6,566 controlled calls across 14 models (1.5B-32B plus one frontier) with 28 tool definitions; all code, data, and checkpoints are public.

Significance. If the semantic-preservation claim holds, the work supplies the first systematic quantification of the tool-context trade-off and demonstrates a practical, low-overhead mitigation that is directly actionable for constrained deployments. Strengths include the large-scale controlled design, public artifacts, and the 32K control that falsifies information-loss explanations for the 8K lift.

minor comments (3)
  1. [Abstract] Abstract: the text states evaluation of 14 models yet reports the main enablement statistics on eight models; explicitly identify the eight-model subset and the selection criterion.
  2. [Abstract] Abstract: the phrase 'across all eight models' for the +20.5 pp lift should be accompanied by the per-model breakdown or at least the range to allow readers to assess uniformity.
  3. The manuscript should state the precise token counts used for the 8K/16K/32K budgets and confirm whether they are measured after compression or before.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurements

full rationale

The paper reports controlled experiments measuring token savings from TSCG compression and exact-match lifts at fixed context budgets (8K/16K/32K) across 14 models and 6,566 calls. No derivations, equations, fitted parameters, or predictions appear; the enablement effect is isolated by direct comparison (overflow vs. fit) and 32K control showing near-zero delta. External validations (HotpotQA, scaling to 800+ tools) are independent tests. No self-citation chains, ansatzes, or self-definitional steps are present. This is a standard empirical result with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study with no free parameters fitted to produce the central claim; relies on standard assumptions about model behavior with tool schemas.

axioms (1)
  • domain assumption Conservative compression of tool schemas preserves sufficient semantic information for correct model interpretation and tool use.
    Required for the observed performance restoration at 8K tokens.

pith-pipeline@v0.9.1-grok · 5793 in / 1264 out tokens · 30059 ms · 2026-06-29T23:21:42.978953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 16 canonical work pages · 11 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Anthropic . 2024. Introducing the Model Context Protocol . https://www.anthropic.com/news/model-context-protocol. Open standard for connecting AI assistants to data sources. Accessed: 2026-04-28

  4. [4]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self- RAG : Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations (ICLR). ArXiv:2310.11511

  5. [5]

    Atlassian Labs . 2026. mcp-compressor: An MCP server wrapper for reducing tokens consumed by MCP tools . https://github.com/atlassian-labs/mcp-compressor. Accessed: 2026-04-28

  6. [6]

    Jacob Cohen. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Lawrence Erlbaum Associates, Hillsdale, NJ

  7. [7]

    Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Richard Charles Hooper, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2024. https://doi.org/10.18653/v1/2024.emnlp-demo.9 T iny A gent: Function calling at the edge . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Proces...

  8. [8]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997. V2, comprehensive RAG survey

  9. [9]

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. In Journal of Machine Learning Research, volume 24, pages 1--43. ArXiv:2208.03299

  10. [10]

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua : Compressing prompts for accelerated inference of large language models. In Proceedings of EMNLP 2023, pages 13358--13376. Association for Computational Linguistics. ArXiv:2310.05736

  11. [11]

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. LongLLMLingua : Accelerating and enhancing LLM s in long context scenarios via prompt compression. In Proceedings of ACL 2024, pages 1658--1677. ArXiv:2310.06839

  12. [12]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459--9474. ArXi...

  13. [13]

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. https://aclanthology.org/2023.emnlp-main.391/ Compressing context to enhance inference efficiency of large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342--6353. Association for Computational Linguistics

  14. [14]

    Lahoti, A., Li, K., Chen, B., Wang, C., Bick, A., Kolter, J

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the ACL, 12:157--173. ArXiv:2307.03172

  15. [15]

    arXiv preprint arXiv:2403.12968 , year=

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor R \"u hle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. 2024. LLMLingua-2 : Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of ACL 2024, pages 963--981. ArXiv:2403.12968

  16. [16]

    Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

    Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. 2025. The Berkeley Function Calling Leaderboard (BFCL) : From tool use to agentic evaluation of large language models. In Proceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 of Proceedings of Machine Learn...

  17. [17]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large language model connected with massive API s. In Advances in Neural Information Processing Systems, volume 37. ArXiv:2305.15334

  18. [18]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ToolLLM : Facilitating large language models to master 16000+ real-world API s. In The Twelfth International Conference on...

  19. [19]

    Furkan Sakizli. 2026. https://doi.org/10.48550/arXiv.2605.04107 TSCG : Deterministic tool-schema compilation for agentic LLM deployments . arXiv preprint arXiv:2605.04107. Companion paper (Paper 1)

  20. [20]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36. ArXiv:2302.04761

  21. [21]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345. ArXiv:2308.11432

  22. [22]

    Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80--83

  23. [23]

    Cohen, Ruslan Salakhutdinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380

  24. [24]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct : Synergizing reasoning and acting in language models. The Eleventh International Conference on Learning Representations (ICLR). ArXiv:2210.03629