arxiv: 2605.04107 · v1 · submitted 2026-05-04 · 💻 cs.SE · cs.AI· cs.CL

Recognition: unknown

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

Furkan Sakizli

Pith reviewed 2026-05-08 17:35 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords tool schema compilationLLM tool usestructured text conversionagentic deploymentsJSON to textdeterministic operatorsformat optimizationMCP schemas

0 comments

The pith

Converting JSON tool schemas to structured text with eight deterministic operators restores high tool-use accuracy for small LLMs on large catalogs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

JSON schemas are built for machine parsers, not language models, so small models fail almost completely once tool catalogs reach production sizes. TSCG applies a fixed set of eight operators to rewrite any valid schema into token-efficient structured text at the API boundary, with no model access or fine-tuning required. The rewrite produces the reported gains, such as lifting Phi-4 14B from 0 percent to 84.4 percent accuracy at 20 tools and 90.3 percent at 50 tools, while delivering 108-181 percent accuracy-retained ratios on BFCL. Decomposition analysis shows the format shift, not the token reduction, explains nearly all the improvement, and the same structured output works on real MCP schemas with 52-57 percent token savings.

Core claim

TSCG is a deterministic compiler that turns JSON tool schemas into structured text using eight composable operators and a formal compression bound of at least 51 percent. This change at the schema level resolves the dominant failure mode for small models in agentic tool use, as confirmed by per-operator isolation and format-versus-compression tests on thousands of calls across 12 models and 5 scenarios. The resulting accuracy and token-efficiency advantages hold on both synthetic and production MCP schemas.

What carries the argument

The eight composable operators that rewrite JSON schemas into structured text, supplying both the accuracy lift and the compression bound.

If this is right

Small models achieve usable accuracy on production-scale tool sets of 20-50 tools.
Token savings of 52-57 percent remain consistent even on heavy real-world schemas.
Different models show distinct sensitivities to individual operators, enabling per-model compilation choices.
Accuracy results on synthetic catalogs generalize to actual MCP schemas within 0.1 points.
The method requires no per-model adjustments or runtime search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interface-level compilation could reduce reliance on larger models for agent workloads.
The operator profiles suggest building lightweight schema adapters tailored to model families.
Similar deterministic rewrites might improve LLM performance on other structured inputs such as API responses or configuration files.
Longer-term deployment pipelines could treat schema format as a tunable parameter separate from model weights.

Load-bearing premise

The structured text created by the operators will be interpreted more reliably by LLMs than the original JSON across arbitrary models and schemas without introducing new parsing errors.

What would settle it

Testing TSCG output against the original JSON on an untested model or schema and finding no accuracy gain or an outright drop would falsify the claim that format change is the primary reliable mechanism.

Figures

Figures reproduced from arXiv: 2605.04107 by Furkan Sakizli.

**Figure 1.** Figure 1: The TSCG 10-pass transform pipeline, executing left-to-right in five phases: view at source ↗

**Figure 2.** Figure 2: TSCG compression: a single tool schema before and after compilation. SDM removes filler; DRO compresses delimiters; TAS aligns to BPE boundaries. 3.2 Operator Taxonomy Definition 3.1 (Token-Reducing Operators). TR = {SDM, DRO, TAS, CFL}: each satisfies |tok(Ti(S))| ≤ |tok(S)|. Definition 3.2 (Structure-Reordering Operators). TS = {CAS,CFO}: preserve token count but change position order. Definition 3.3 (To… view at source ↗

**Figure 3.** Figure 3: TSCG delta heatmap (∆ = TSCG accuracy − JSON-baseline accuracy, in percentage points) across seven models and seven tool-count conditions. Green cells indicate conditions where TSCG improves over JSON schemas; red cells indicate degradation. Five behavioral clusters emerge: Format Effect†—Phi-4’s +75–90 pp gains are JSON→text translation, not compression benefit (§6.2); Format-dominated†—Mistral and Gemma … view at source ↗

**Figure 4.** Figure 4: Format-confound decomposition of the predictive model. view at source ↗

**Figure 6.** Figure 6: ). Token savings remain 52–57% throughout. This indicates saturation is catalog-weightspecific, not a fundamental TSCG limit: heavy per-tool schemas carry more parser-ambiguity that structured compilation resolves. 6.5 Benchmark Generalization: TAB → MCP (T1.3) Sonnet 4 on 43-tool MCP Combined: TSCG delta = −1.6 pp on synthetic TAB vs. −1.7 pp on real MCP schemas—the synthetic benchmark predicts real MCP… view at source ↗

read the original abstract

Production agent frameworks (OpenAI Function Calling, Anthropic Tool Use, MCP) transmit tool schemas as JSON, a format designed for machine parsing, not for interpretation by language models. For small models (4B-14B), this protocol mismatch accounts for the majority of tool-use failure at production catalog sizes. We present TSCG, a deterministic tool-schema compiler that resolves this mismatch at the API boundary, converting JSON schemas into token-efficient structured text without model access, fine-tuning, or runtime search. TSCG combines eight composable operators with a formal compression bound (>=51% on well-formed schemas). On TSCG-Agentic-Bench (about 19,000 calls, 12 models, 5 scenarios), TSCG restores Phi-4 14B from 0% to 84.4% accuracy at 20 tools (90.3% at 50 tools) and achieves 108-181% accuracy-retained ratio across three models on BFCL. Format-versus-compression decomposition (R^2=0.88 -> 0.03) establishes representation change as the dominant mechanism. Per-operator isolation across three frontier models reveals three distinct operator-response profiles: operator-hungry (Opus 4.7), operator-sensitive (GPT-5.2), and operator-robust (Sonnet 4), providing per-model deployment guidance. Scaling experiments show accuracy advantages persisting on heavy production MCP schemas (+5.0 pp at about 10,500 input tokens) despite saturation on light synthetic catalogs, with 52-57% token savings throughout. The synthetic benchmark generalizes to real MCP schemas within 0.1 accuracy points. TSCG ships as a 1,200-line zero-dependency TypeScript package.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TSCG gives a practical, zero-dependency way to rewrite JSON tool schemas as structured text and shows this lifts small-model tool accuracy without training.

read the letter

The main takeaway is that TSCG turns JSON tool schemas into token-efficient structured text using eight fixed operators, and the reported results indicate this format shift drives most of the accuracy recovery for smaller models on tool-use tasks. The paper demonstrates this on a benchmark of roughly 19,000 calls across 12 models, restoring Phi-4 14B from 0% to 84.4% at 20 tools and maintaining high accuracy at 50 tools. It also reports 108-181% accuracy retention on BFCL for three models and shows the gains hold on real MCP schemas within 0.1 points of the synthetic results. The decomposition analysis (R^2 falling from 0.88 to 0.03) isolates format change over compression, and the per-model operator profiles plus deployment guidance add concrete value. Shipping a 1,200-line TypeScript package with a formal compression bound is a plus for anyone who wants to try it directly. The work is reproducible on its face and addresses a real deployment friction point in agent frameworks. Soft spots are limited. The operator profiles come mainly from frontier models, so the exact sensitivity for the 4B-14B models that show the largest lifts could use more direct data. The benchmarks are sizable but still rely on the authors' own scenarios alongside BFCL; broader schema diversity beyond MCP would help. No load-bearing circularity or unfalsifiable claims stand out from the reported setup. This paper is for practitioners building or scaling agentic LLM systems who need reliable tool calling on modest models or production catalogs. It has enough quantified results, an artifact, and a clear mechanism to merit a serious referee.

Referee Report

2 major / 3 minor

Summary. The paper introduces TSCG, a deterministic zero-dependency compiler that converts JSON tool schemas into structured text via eight composable operators, together with a formal compression bound of at least 51%. On TSCG-Agentic-Bench (~19k calls, 12 models) it reports restoring Phi-4 14B accuracy from 0% to 84.4% at 20 tools (90.3% at 50 tools), 108-181% accuracy-retained ratios on BFCL, and persistence of gains on heavy real MCP schemas; a format-versus-compression regression (R² 0.88 → 0.03) isolates representation change as the dominant mechanism, while per-operator profiles supply model-specific deployment guidance.

Significance. If the empirical results hold, the work supplies a practical, training-free intervention at the API boundary that directly addresses a documented failure mode for small-to-medium models in production agent frameworks. Notable strengths include the formal compression bound, the multi-model (12-model) validation, the explicit generalization test to real MCP schemas within 0.1 accuracy points, and the open 1,200-line TypeScript package that enables immediate reproducibility.

major comments (2)

[§4.2] §4.2 (Benchmark Construction): the central accuracy claims (e.g., Phi-4 14B restoration to 84.4% at 20 tools) rest on ~19k calls across five scenarios; the manuscript must supply the precise accuracy metric definition, the full set of baseline controls (including any prompt or decoding variations), and statistical significance tests to confirm the lifts are not artifacts of benchmark construction.
[§5.1] §5.1 (Format-versus-Compression Regression): the reported R² drop from 0.88 to 0.03 is used to establish representation change as dominant; confirm that the regression controls for token-length correlation and that the 52-57% token savings do not themselves explain the accuracy delta after the format variable is included.

minor comments (3)

[Table 3] Table 3 (Per-operator profiles): label the three response profiles (operator-hungry, operator-sensitive, operator-robust) with explicit model names and include error bars or confidence intervals for the accuracy deltas.
[§3] §3 (Operator Definitions): provide a compact pseudocode or formal grammar for the eight operators and their composition rules so that the deterministic guarantee can be verified without reading the full TypeScript source.
[Figure 4] Figure 4 (Scaling on MCP schemas): clarify the input-token range (~10,500 tokens) and whether the +5.0 pp gain is measured against the same JSON baseline or an already-optimized prompt.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The two major comments identify areas where additional transparency will strengthen the paper. We address each point below and will incorporate the requested details and clarifications in the revised manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Benchmark Construction): the central accuracy claims (e.g., Phi-4 14B restoration to 84.4% at 20 tools) rest on ~19k calls across five scenarios; the manuscript must supply the precise accuracy metric definition, the full set of baseline controls (including any prompt or decoding variations), and statistical significance tests to confirm the lifts are not artifacts of benchmark construction.

Authors: We agree that explicit documentation of the accuracy metric, controls, and significance testing is required for full reproducibility and to rule out benchmark artifacts. Accuracy is defined as the exact-match rate on tool name, parameter values, and output format against ground-truth invocations. All baselines used identical prompt templates (provided in the appendix) and greedy decoding (temperature = 0) with no sampling variations. In the revision we will expand §4.2 with (i) the formal metric definition, (ii) the complete prompt and decoding specifications, and (iii) McNemar’s test results for paired accuracy differences (all reported lifts p < 0.01). These additions directly address the concern. revision: yes
Referee: [§5.1] §5.1 (Format-versus-Compression Regression): the reported R² drop from 0.88 to 0.03 is used to establish representation change as dominant; confirm that the regression controls for token-length correlation and that the 52-57% token savings do not themselves explain the accuracy delta after the format variable is included.

Authors: The regression is hierarchical: token length is entered first as a covariate, after which the binary format indicator (TSCG vs. JSON) is added. The reported R² drop therefore already reflects incremental explanatory power of format after length is controlled. The 52–57 % token reduction is a direct consequence of the TSCG representation rather than an independent variable. In the revision we will add the full coefficient table, standard errors, and variance-inflation-factor diagnostics (all VIF < 2) to make the control explicit and to confirm that format accounts for the accuracy delta independently of length. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on empirical benchmark results from TSCG-Agentic-Bench (approximately 19,000 calls across 12 models) and BFCL, demonstrating accuracy restoration via deterministic JSON-to-structured-text conversion using eight operators. No mathematical derivation chain, fitted parameters renamed as predictions, or self-referential definitions appear; the format-versus-compression regression (R^2 drop from 0.88 to 0.03) and per-operator profiles are direct measurements on held-out data, independent of the compiler's definition. The formal compression bound and zero-dependency implementation provide separate support without reducing claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the approach assumes well-formed input schemas and that LLMs benefit from the transformed representation; no free parameters or invented entities are explicitly introduced.

axioms (1)

domain assumption Input tool schemas are valid and well-formed JSON that can be parsed deterministically
Required for the compiler to produce consistent output without runtime errors.

pith-pipeline@v0.9.0 · 5621 in / 1278 out tokens · 90968 ms · 2026-05-08T17:35:56.329353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 25 canonical work pages · 6 internal anchors

[1]

Anthropic . 2024. Introducing the Model Context Protocol . https://www.anthropic.com/news/model-context-protocol. Open standard for connecting AI assistants to data sources. Accessed: 2026-04-28

2024
[2]

Anthropic . 2025. Code execution with MCP: Building more efficient AI agents . https://www.anthropic.com/engineering/code-execution-with-mcp. Engineering blog post. Accessed: 2026-04-28

2025
[3]

Md Adnan Arefeen, Biplob Debnath, and Srimat Chakradhar. 2024. https://doi.org/10.1016/j.nlp.2024.100065 LeanContext : Cost-efficient domain-specific question answering using LLM s . Natural Language Processing Journal, 7:100065. ArXiv:2309.00841

work page doi:10.1016/j.nlp.2024.100065 2024
[4]

Atlassian Labs . 2026. mcp-compressor: An MCP server wrapper for reducing tokens consumed by MCP tools . https://github.com/atlassian-labs/mcp-compressor. Open-source MCP proxy. Accessed: 2026-04-28

2026
[5]

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.232 Adapting language models to compress contexts . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829--3846. Association for Computational Linguistics. ArXiv:2305.14788

work page doi:10.18653/v1/2023.emnlp-main.232 2023
[6]

Joong Ho Choi, Jiayang Zhao, Jeel Shah, Ritvika Sonawane, Mihir Sonawane, Navya Singh, Avani Appalla, Will Flanagan, and Filipe Condessa. 2025. https://arxiv.org/abs/2510.18043 CompactPrompt : A unified pipeline for prompt and data compression in LLM workflows . In 2nd Workshop on LLMs and Generative AI for Finance, ACM ICAIF '25, Singapore. ArXiv:2510.18...

work page arXiv 2025
[7]

Cloudflare . 2026. Code Mode: Bringing programmatic tool access to AI agents . https://blog.cloudflare.com/code-mode-mcp/. Engineering blog post. Accessed: 2026-04-28

2026
[8]

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt \"a schel. 2023. https://arxiv.org/abs/2309.16797 Promptbreeder : Self-referential self-improvement via prompt evolution . arXiv preprint arXiv:2309.16797

work page arXiv 2023
[9]

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2024. https://openreview.net/forum?id=ZG3RaNIsO8 EvoPrompt : Connecting large language models with evolutionary algorithms yields powerful prompt optimizers . In The Twelfth International Conference on Learning Representations (ICLR). ArXiv:2309.08532

work page arXiv 2024
[10]

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.825 LLMLingua : Compressing prompts for accelerated inference of large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358--13376. Association for Computational Linguist...

work page doi:10.18653/v1/2023.emnlp-main.825 2023
[11]

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. https://aclanthology.org/2024.acl-long.91/ LongLLMLingua : Accelerating and enhancing LLM s in long context scenarios via prompt compression . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

work page arXiv 2024
[12]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. https://arxiv.org/abs/2310.03714 DSPy : Compiling declarative language model calls into self-improving pipelines . arXiv preprint arXiv:2310.03714

work page internal anchor Pith review arXiv 2023
[13]

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342--6353. Association for Computational Linguistics. ArXiv:2310.06201

work page arXiv 2023
[14]

Zongqian Li, Yinhong Liu, Yixuan Su, and Nigel Collier. 2025. https://doi.org/10.18653/v1/2025.naacl-long.368 Prompt compression for large language models: A survey . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7182...

work page doi:10.18653/v1/2025.naacl-long.368 2025
[15]

Jesse Mu, Xiang Lisa Li, and Noah Goodman. 2023. https://papers.nips.cc/paper_files/paper/2023/hash/3d77c6dcc7f143aa2154e7f4d5e22d68-Abstract-Conference.html Learning to compress prompts with gist tokens . In Advances in Neural Information Processing Systems, volume 36. ArXiv:2304.08467

work page arXiv 2023
[16]

In: Findings of the Association for Computational Linguistics: ACL 2024 (Aug 2024)

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor R \"u hle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. 2024. https://doi.org/10.18653/v1/2024.findings-acl.57 LLMLingua-2 : Data distillation for efficient and faithful task-agnostic prompt compression . In Findings of the Associatio...

work page doi:10.18653/v1/2024.findings-acl.57 2024
[17]

Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. 2025. https://proceedings.mlr.press/v267/patil25a.html The Berkeley Function Calling Leaderboard (BFCL) : From tool use to agentic evaluation of large language models . In Proceedings of the 42nd International Conference on Machine Learning ...

2025
[18]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/e4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html Gorilla: Large language model connected with massive API s . In Advances in Neural Information Processing Systems, volume 37. ArXiv:2305.15334

work page internal anchor Pith review arXiv 2024
[19]

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.494 Automatic prompt optimization with ``gradient descent'' and beam search . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7957--7968. Association for Computational Linguistic...

work page doi:10.18653/v1/2023.emnlp-main.494 2023
[20]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ToolLLM : Facilitating large language models to master 16000+ real-world API s. In The Twelfth International Conference on...

work page internal anchor Pith review arXiv 2024
[21]

Tobias Schnabel and Jennifer Neville. 2024. https://aclanthology.org/2024.findings-emnlp.37/ Symbolic prompt program search: A structure-aware approach to efficient compile-time prompt optimization . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 670--686, Miami, Florida. Association for Computational Linguistics. SAMMO fr...

work page arXiv 2024
[22]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1162 2016
[23]

toon-format . 2025. TOON: Token-Oriented Object Notation Specification . https://github.com/toon-format/toon. GitHub: toon-format/toon. Accessed: 2026-04-28

2025
[24]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://papers.nips.cc/paper/7181-attention-is-all-you-need Attention is all you need . In Advances in Neural Information Processing Systems, volume 30, pages 5998--6008. ArXiv:1706.03762

work page internal anchor Pith review arXiv 2017
[25]

Ar k, and Tomas Pfister

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun Dai, Julian Martin Eisenschlos, Sercan \"O . Ar k, and Tomas Pfister. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.461 Universal self-adaptive prompting . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7437--7462. Association for Computational Linguistic...

work page doi:10.18653/v1/2023.emnlp-main.461 2023
[26]

Ming Wang, Yuanzhong Liu, Xiaoyu Liang, Songlian Li, Yijie Huang, Xiaoming Zhang, Sijia Shen, Chaofeng Guan, Daling Wang, Shi Feng, Huaiwen Zhang, Yifei Zhang, Minghui Zheng, and Chi Zhang. 2024 a . https://arxiv.org/abs/2402.16929 LangGPT : Rethinking structured reusable prompt design framework for large language models from the programming language pers...

work page arXiv 2024
[27]

Xing, and Zhiting Hu

Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing, and Zhiting Hu. 2024 b . https://openreview.net/forum?id=22pyNMuIoa PromptAgent : Strategic planning with language models enables expert-level prompt optimization . In The Twelfth International Conference on Learning Representations (ICLR). ArXiv:2310.16427

work page arXiv 2024
[28]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. https://openreview.net/forum?id=NG7sS51zVF Efficient streaming language models with attention sinks . In The Twelfth International Conference on Learning Representations (ICLR). ArXiv:2309.17453

work page internal anchor Pith review arXiv 2024
[29]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2024. https://openreview.net/forum?id=Bb4VGOWELI Large language models as optimizers . In The Twelfth International Conference on Learning Representations (ICLR). ArXiv:2309.03409

work page arXiv 2024
[30]

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. 2025. https://www.nature.com/articles/s41586-025-08661-4 Optimizing generative AI by backpropagating language model feedback . Nature, 639:609--616. Preprint available as arXiv:2406.07496 ``TextGrad: Automatic Differentiation via Text''

work page internal anchor Pith review arXiv 2025
[31]

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. https://openreview.net/forum?id=92gvk82DE- Large language models are human-level prompt engineers . In The Eleventh International Conference on Learning Representations (ICLR). ArXiv:2211.01910

work page arXiv 2023