pith. machine review for the scientific record. sign in

arxiv: 2406.07496 · v1 · submitted 2024-06-11 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

TextGrad: Automatic "Differentiation" via Text

Authors on Pith no claims yet

Pith reviewed 2026-05-13 11:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords TextGradautomatic differentiationcompound AI systemstextual feedbackLLM optimizationbackpropagationPyTorch syntaxmulti-component AI
0
0 comments X

The pith

TextGrad backpropagates LLM textual feedback to optimize individual components in compound AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TextGrad as a framework for automatic differentiation through text. It allows LLMs to supply natural language feedback that updates variables across a computation graph of AI components. This mirrors the role of backpropagation in making neural network training automated and scalable. Readers would care if the approach generalizes because it could convert ad-hoc system building into a more systematic optimization process for multi-model applications.

Core claim

TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system, following PyTorch syntax and abstraction, and works out-of-the-box across tasks from question answering and coding to molecule design and radiotherapy planning.

What carries the argument

The TextGrad framework, which uses LLMs to generate natural language suggestions that serve as gradients for optimizing variables in a computation graph.

If this is right

  • Zero-shot accuracy of GPT-4o on Google-Proof Question Answering rises from 51% to 55%.
  • 20% relative performance gain on LeetCode-Hard coding problem solutions.
  • New druglike small molecules are designed with desirable in silico binding.
  • Radiation oncology treatment plans are produced with high specificity.
  • Reasoning prompts improve without any framework modifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the feedback mechanism holds, entire multi-agent pipelines could be tuned with minimal human input.
  • Hybrid extensions might combine textual feedback with numerical gradients in existing ML libraries.
  • Limits may appear when scaling to graphs with hundreds of interdependent components.
  • The method could apply to non-AI domains where structured variables admit natural language descriptions.

Load-bearing premise

LLM-generated natural language feedback is sufficiently general, consistent, and actionable to drive reliable optimization across domains without domain-specific prompt engineering or component tuning.

What would settle it

Apply TextGrad unchanged to a new domain such as quantum circuit design and measure whether performance gains exceed those from manual prompting baselines.

read the original abstract

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces TextGrad, a framework for automatic differentiation via text that backpropagates natural-language feedback generated by LLMs to optimize individual components (e.g., code, prompts, molecular structures) within compound AI systems. Following PyTorch-like syntax, it claims to require only an objective function from the user and no prompt or component tuning, with empirical gains reported on GPQA (51% to 55% zero-shot GPT-4o accuracy), LeetCode-Hard (20% relative improvement), prompt optimization for reasoning, in silico molecule design, and radiotherapy treatment planning.

Significance. If the reported gains prove robust and reproducible, TextGrad would represent a significant step toward general, turn-key optimization methods for multi-component AI systems, analogous to backpropagation's role in neural networks. The cross-domain demonstrations (coding, QA, molecular design, medical planning) without domain-specific engineering support the claimed generality and could accelerate development of orchestrated LLM systems.

major comments (3)
  1. [Experiments] Experiments (results on GPQA, LeetCode-Hard, etc.): concrete performance gains are reported without error bars, ablation studies on feedback-LLM choice, temperature, or system-prompt variants, and without explicit baseline-construction details. This directly weakens the central claim of reliable, out-of-the-box optimization, as LLM feedback is known to be stochastic and prompt-sensitive.
  2. [Methods] Framework and Methods: no formal argument, propagation analysis, or counterexample testing is supplied to show why textual feedback reliably traverses the computation graph for variables ranging from code to molecular structures. The weakest assumption (general, consistent, actionable LLM feedback without hidden tuning) therefore remains untested.
  3. [Implementation] Implementation details: the assertion of zero prompt or component tuning is not accompanied by variance measurements or sensitivity analysis on the feedback-generation step, leaving open whether reported improvements depend on unstated choices of the feedback LLM or exact prompt templates.
minor comments (1)
  1. [Framework] Notation: the analogy to PyTorch is helpful but the precise mapping from textual feedback to variable updates could be clarified with a small pseudocode example in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We appreciate the recognition of TextGrad's potential impact and the specific concerns raised about experimental robustness, methodological assumptions, and implementation transparency. We address each major comment below and will revise the manuscript to strengthen these aspects.

read point-by-point responses
  1. Referee: [Experiments] Experiments (results on GPQA, LeetCode-Hard, etc.): concrete performance gains are reported without error bars, ablation studies on feedback-LLM choice, temperature, or system-prompt variants, and without explicit baseline-construction details. This directly weakens the central claim of reliable, out-of-the-box optimization, as LLM feedback is known to be stochastic and prompt-sensitive.

    Authors: We agree that error bars, ablations, and explicit baseline details are important for substantiating the reliability claims. In the revised manuscript, we will add error bars computed over at least five independent runs with different random seeds for all reported results. We will include ablations varying the feedback LLM (e.g., GPT-4o, GPT-3.5-turbo, Claude-3), temperature settings (0.0, 0.5, 1.0), and system-prompt variants. We will also expand the experimental section with precise descriptions of baseline construction, including any prompts or procedures used for comparison methods, to demonstrate that improvements hold under the out-of-the-box setting. revision: yes

  2. Referee: [Methods] Framework and Methods: no formal argument, propagation analysis, or counterexample testing is supplied to show why textual feedback reliably traverses the computation graph for variables ranging from code to molecular structures. The weakest assumption (general, consistent, actionable LLM feedback without hidden tuning) therefore remains untested.

    Authors: We acknowledge that the current manuscript lacks a formal theoretical analysis of feedback propagation. The design is motivated by the empirical analogy to backpropagation, and we demonstrate successful optimization across four heterogeneous domains (reasoning, coding, molecular design, and treatment planning) where variables differ substantially in structure. In revision, we will add a new subsection discussing the core assumptions, including when LLM feedback may fail to be actionable, and we will include observed counterexamples or failure modes from our development process to better delineate the method's scope and limitations. revision: partial

  3. Referee: [Implementation] Implementation details: the assertion of zero prompt or component tuning is not accompanied by variance measurements or sensitivity analysis on the feedback-generation step, leaving open whether reported improvements depend on unstated choices of the feedback LLM or exact prompt templates.

    Authors: We will revise the implementation and experimental sections to clarify that the framework relies on a small set of fixed, general-purpose prompts for feedback generation that are not tuned per task. To address sensitivity concerns, we will report variance measurements across different feedback LLMs and minor prompt variations. We will also include the exact prompt templates in the supplementary material and open-source code release, enabling readers to assess and reproduce the sensitivity of results to these choices. revision: yes

Circularity Check

0 steps flagged

No circularity in TextGrad framework claims or results

full rationale

The paper presents TextGrad as a new textual backpropagation framework that uses LLM-generated natural language feedback to optimize components in compound AI systems. Claims rest on empirical demonstrations (e.g., accuracy gains on GPQA and LeetCode) rather than any mathematical derivation chain, fitted parameters renamed as predictions, or self-referential definitions. No equations appear, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results is used to establish the core method. The framework is described as following PyTorch syntax with out-of-the-box applicability, supported by reported experimental outcomes across domains. This is self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can supply general, useful textual feedback for optimization; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption LLMs can provide rich, general, natural language suggestions that improve variables in computation graphs
    This premise is required for the backpropagation-via-text mechanism to function without additional tuning.

pith-pipeline@v0.9.0 · 5605 in / 1230 out tokens · 47459 ms · 2026-05-13T11:23:28.588133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Harnessing Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 7.0

    AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

  2. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  3. The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

    cs.CL 2026-05 unverdicted novelty 7.0

    An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

  4. More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

    cs.AI 2026-05 conditional novelty 7.0

    Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- ...

  5. TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

    cs.SE 2026-05 unverdicted novelty 7.0

    TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

  6. RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

    cs.LG 2026-04 unverdicted novelty 7.0

    RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than...

  7. Meta-Harness: End-to-End Optimization of Model Harnesses

    cs.AI 2026-03 unverdicted novelty 7.0

    Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...

  8. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  9. PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

    cs.AI 2026-05 unverdicted novelty 6.0

    PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.

  10. MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

    cs.CR 2026-05 unverdicted novelty 6.0

    MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.

  11. AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems

    cs.LG 2026-05 unverdicted novelty 6.0

    AgentSlimming compresses graph-structured multi-agent systems by estimating agent importance and removing or replacing low-value agents, cutting token costs by up to 78.9% with negligible performance loss.

  12. FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...

  13. Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.

  14. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  15. ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis

    cs.AI 2026-04 unverdicted novelty 6.0

    ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.

  16. SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weigh...

  17. Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

    cs.AI 2026-04 unverdicted novelty 6.0

    POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.

  18. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  19. Scaling Multi-agent Systems: A Smart Middleware for Improving Agent Interactions

    cs.MA 2026-04 unverdicted novelty 6.0

    Cognitive Fabric Nodes middleware improves multi-agent LLM system performance by over 10% on HotPotQA and MuSiQue datasets by elevating memory to an active substrate for topology selection, semantic grounding, securit...

  20. Reflective Context Learning: Studying the Optimization Primitives of Context Space

    cs.LG 2026-04 unverdicted novelty 6.0

    Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...

  21. Self-Optimizing Multi-Agent Systems for Deep Research

    cs.IR 2026-04 unverdicted novelty 6.0

    Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.

  22. Memory in the Age of AI Agents

    cs.CL 2025-12 unverdicted novelty 6.0

    The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

  23. Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement

    cs.CL 2026-05 unverdicted novelty 5.0

    LANCE applies variational inference for label enhancement across multiple rejection categories, supplying gradients to a refinement model that produces safe, non-rigid responses from LLMs.

  24. Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

    cs.CL 2026-04 unverdicted novelty 5.0

    AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.

  25. Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt

    cs.CL 2026-04 unverdicted novelty 4.0

    A small language model resolves semantic risks and conflicts in prompts via multi-perspective consistency checks, yielding a 2.5-point gain in LLM reasoning performance at $0.02 cost.

  26. Supplement Generation Training for Enhancing Agentic Task Performance

    cs.LG 2026-04 unverdicted novelty 4.0

    SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.

  27. Statistical Software Engineering with Tuned Variables

    cs.SE 2026-04 unverdicted novelty 4.0

    AI system maintenance requires treating configuration choices as versioned governed tuned variables promoted via statistical evidence from sampled evaluations.

  28. Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

    cs.CL 2026-04 unverdicted novelty 3.0

    Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 27 Pith papers · 7 internal anchors

  1. [1]

    D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A.,et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A.,et al. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)

  2. [2]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., Alayrac, J.-b., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  4. [4]

    Llama 3 Model Card

    AI@Meta. Llama 3 Model Card. https://github.com/meta- llama/llama3/blob/main/MODEL_ CARD.md (2024)

  5. [5]

    The Claude 3 Model Family: Opus, Sonnet, Haiku

    Anthropic, A. The Claude 3 Model Family: Opus, Sonnet, Haiku. Claude-3 Model Card (2024)

  6. [6]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  7. [7]

    H., Wu, Y., Le, Q

    Trinh, T. H., Wu, Y., Le, Q. V ., He, H. & Luong, T. Solving olympiad geometry without human demon- strations. Nature 625, 476–482 (2024)

  8. [8]

    Competition-level code generation with alphacode

    Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science 378, 1092–1097 (2022)

  9. [9]

    E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K

    Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. & Press, O. SWE-agent: Agent- Computer Interfaces Enable Automated Software Engineering 2024

  10. [10]

    V ., Haq, S., Sharma, A., Joshi, T

    Khattab, O., Singhvi, A., Maheshwari, P ., Zhang, Z., Santhanam, K., A, S. V ., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M. & Potts, C.DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=sY5N0zY5Od

  11. [11]

    Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N

    Zaharia, M., Khattab, O., Chen, L., Davis, J. Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N. & Ghodsi, A. The Shift from Models to Compound AI Systems https://bair.berkeley.edu/ blog/2024/02/18/compound-ai-systems/. 2024

  12. [12]

    I., Han, Z., Paster, K., Pitis, S., Chan, H

    Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H. & Ba, J. Large Language Models are Human-Level Prompt Engineers in The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=92gvk82DE-

  13. [13]

    & Hinton, G

    Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)

  14. [14]

    Highly accurate protein structure prediction with AlphaFold

    Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021)

  15. [15]

    J., Schrittwieser, J., Swirszcz, G., et al

    Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 47–53 (2022)

  16. [16]

    J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., et al

    Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., et al. Faster sorting algorithms discovered using deep reinforcement learn- ing. Nature 618, 257–263 (2023)

  17. [17]

    S., Aykol, M., Cheon, G

    Merchant, A., Batzner, S., Schoenholz, S. S., Aykol, M., Cheon, G. & Cubuk, E. D. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023)

  18. [18]

    & Courville, A

    Goodfellow, I., Bengio, Y. & Courville, A. Deep learning (MIT press, 2016)

  19. [19]

    Differentiation

    Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986). 18 Automatic “Differentiation” via Text

  20. [20]

    & Darrell, T

    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S. & Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014)

  21. [21]

    & Bengio, Y

    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P ., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D. & Bengio, Y. Theano: A CPU and GPU Math Expression Compiler in Proceedings of the Python for Scientific Computing Conference (SciPy) (2010)

  22. [22]

    TensorFlow: A System for Large-Scale Machine Learning in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (2016), 265–283

    Abadi, M., Barham, P ., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. TensorFlow: A System for Large-Scale Machine Learning in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (2016), 265–283

  23. [23]

    Pytorch: An imperative style, high-performance deep learning library

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)

  24. [24]

    & Mariéthoz, J

    Collobert, R., Bengio, S. & Mariéthoz, J. Torch: a modular machine learning software library (2002)

  25. [25]

    Gradient Descent

    Pryzant, R., Iter, D., Li, J., Lee, Y., Zhu, C. & Zeng, M. Automatic Prompt Optimization with “Gradient Descent” and Beam Search in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H., Pino, J. & Bali, K.) (Association for Computational Linguistics, Singa- pore, Dec. 2023), 7957–7968. https://aclantholog...

  26. [26]

    & Yao, S

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: language agents with ver- bal reinforcement learning in Advances in Neural Information Processing Systems 36 (2023). https : / / proceedings.neurips.cc/paper_files/paper/2023/file/1b44b878bb782e6954cd888628510e90- Paper-Conference.pdf

  27. [27]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J. & Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022 (2023)

  28. [28]

    & Hashimoto, T

    Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P . & Hashimoto, T. B.Alpacae- val: An automatic evaluator of instruction-following models 2023

  29. [29]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T.,et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)

  30. [30]

    Self-refine: Iterative refinement with self-feedback

    Madaan, A., Tandon, N., Gupta, P ., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhu- moye, S., Yang, Y., et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Infor- mation Processing Systems 36 (2024)

  31. [31]

    & Christiano, P

    Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D. & Christiano, P . F. Learning to summarize with human feedback.Advances in Neural Information Processing Systems 33, 3008–3021 (2020)

  32. [32]

    Self-Rewarding Language Models

    Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J. & Weston, J. Self-rewarding language models. arXiv preprint arXiv:2401.10020 (2024)

  33. [33]

    X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P

    Dubois, Y., Li, C. X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P . S. & Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems 36 (2024)

  34. [34]

    Large-scale machine learning with stochastic gradient descent

    Bottou, L. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT’2010, 177–186 (2010)

  35. [35]

    Boyd, S., Boyd, S. P . & Vandenberghe, L. Convex optimization (Cambridge university press, 2004)

  36. [36]

    Training language models to follow instructions with human feedback

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, 27730–27744 (2022)

  37. [37]

    Differentiation

    Wei, J., Bosma, M., Zhao, V ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M. & Le, Q. V . Finetuned Language Models are Zero-Shot Learners in International Conference on Learning Representations (2022). https://openreview.net/forum?id=gEZrGCozdqR. 19 Automatic “Differentiation” via Text

  38. [38]

    & Nushi, B

    Yuksekgonul, M., Chandrasekaran, V ., Jones, E., Gunasekar, S., Naik, R., Palangi, H., Kamar, E. & Nushi, B. Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum? id=gfFVATffPd

  39. [39]

    I., Gunasekar, S., Chandrasekaran, V ., Li, J., Yuksekgonul, M., Peshawaria, R

    Abdin, M. I., Gunasekar, S., Chandrasekaran, V ., Li, J., Yuksekgonul, M., Peshawaria, R. G., Naik, R. & Nushi, B. KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval in The Twelfth International Conference on Learning Representations (2024). https : / / openreview . net / forum ? id = b3kDP3IytM

  40. [40]

    Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computa- tional Mathematics and Mathematical Physics 4, 1–17 (1964)

  41. [41]

    & Hinton, G

    Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning in International conference on machine learning (2013), 1139–1147

  42. [42]

    & Hardt, M

    Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A. & Hardt, M. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts in Proceedings of the 37th International Conference on Machine Learning (PMLR, 2020). https://proceedings.mlr.press/v119/sun20b.html

  43. [43]

    Learning to (learn at test time)

    Sun, Y., Li, X., Dalal, K., Hsu, C., Koyejo, S., Guestrin, C., Wang, X., Hashimoto, T. & Chen, X. Learning to (learn at test time). arXiv preprint arXiv:2310.13807 (2023)

  44. [44]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R. & Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Modelsin The Eleventh International Conference on Learning Representations(2023). https://openreview.net/forum?id=WE_vluYUL-X

  45. [45]

    & Steinhardt, J

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. Measuring Massive Multitask Language Understanding in International Conference on Learning Representations(2021). https: //openreview.net/forum?id=d7KBjmI3GmQ

  46. [46]

    S., Reid, M., Matsuo, Y

    Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reason- ers. Advances in neural information processing systems 35, 22199–22213 (2022)

  47. [47]

    V ., Zhou, D., et al

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information process- ing systems 35, 24824–24837 (2022)

  48. [48]

    Hello GPT-4o Accessed: 2024-05-18

    OpenAI. Hello GPT-4o Accessed: 2024-05-18. 2024. https://openai.com/index/hello-gpt-4o/

  49. [49]

    & Neubig, G

    Liu, P ., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. & Neubig, G. Pre-train, prompt, and predict: A system- atic survey of prompting methods in natural language processing. ACM Computing Surveys 55, 1–35 (2023)

  50. [50]

    W., Chowdhery, A., Le, Q., Chi, E., Zhou, D

    Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D. & Wei, J. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them in Findings of the Association for Computational Linguistics: ACL 2023 (Association for Computational Linguistics, Toronto, Canada, July 2023). https://aclantho...

  51. [51]

    Srivastava, A. et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research. ISSN : 2835-8856. https://openreview. net/forum?id=uyTL5Bvosj (2023)

  52. [52]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C. & Schulman, J. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168 (2021)

  53. [53]

    Nicolaou, C. A. & Brown, N. Multi-objective optimization methods in drug design. Drug Discovery Today: Technologies10, e427–e435 (2013)

  54. [54]

    Hoelder, S., Clarke, P . A. & Workman, P . Discovery of small molecule cancer drugs: successes, chal- lenges and opportunities. Molecular oncology 6, 155–176 (2012)

  55. [55]

    Differentiation

    Kontoyianni, M. Docking and virtual screening in drug discovery. Proteomics for drug discovery: Meth- ods and protocols, 255–266 (2017). 20 Automatic “Differentiation” via Text

  56. [56]

    & Mehrotra, R

    Agarwal, S. & Mehrotra, R. An overview of molecular docking. JSM chem 4, 1024–1028 (2016)

  57. [57]

    & Olson, A

    Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of computational chemistry31, 455– 461 (2010)

  58. [58]

    & Oprea, T

    Ursu, O., Rayan, A., Goldblum, A. & Oprea, T. I. Understanding drug-likeness. Wiley Interdisciplinary Reviews: Computational Molecular Science 1, 760–781 (2011)

  59. [59]

    R., Paolini, G

    Bickerton, G. R., Paolini, G. V ., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nature chemistry 4, 90–98 (2012)

  60. [60]

    J., Gahbauer, S., Luttens, A., Lyu, J., Webb, C

    Bender, B. J., Gahbauer, S., Luttens, A., Lyu, J., Webb, C. M., Stein, R. M., Fink, E. A., Balius, T. E., Carlsson, J., Irwin, J. J., et al. A practical guide to large-scale docking. Nature protocols 16, 4799–4832 (2021)

  61. [61]

    N., Tripp, A

    García-Ortegón, M., Simm, G. N., Tripp, A. J., Hernández-Lobato, J. M., Bender, A. & Bacallado, S. DOCKSTRING: easy molecular docking yields better benchmarks for ligand design. Journal of chemi- cal information and modeling 62, 3486–3502 (2022)

  62. [62]

    M., Sperduto, P

    Khan, F. M., Sperduto, P . W. & Gibbons, J. P .Khan’s Treatment Planning in Radiation Oncology:.(Lippin- cott Williams & Wilkins, 2021)

  63. [63]

    The physical basis of IMRT and inverse planning

    Webb, S. The physical basis of IMRT and inverse planning. The British journal of radiology 76, 678–689 (2003)

  64. [64]

    Hussein, M., Heijmen, B. J. M., Verellen, D. & Nisbet, A. Automation in Intensity Modulated Radio- therapy Treatment Planning—a Review of Recent Innovations. British Journal of Radiology91, 20180270. ISSN : 0007-1285. (2024) (Dec. 2018)

  65. [65]

    Development of the open-source dose calculation and optimization toolkit matRad

    Wieser, H.-P ., Cisternas, E., Wahl, N., Ulrich, S., Stadler, A., Mescher, H., Müller, L.-R., Klinge, T., Gabrys, H., Burigo, L., et al. Development of the open-source dose calculation and optimization toolkit matRad. Medical Physics 44, 2556–2568 (2017)

  66. [66]

    T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., et al

    Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452 (2023)

  67. [67]

    L., Wallace, E

    Shin, T., Razeghi, Y., Logan IV , R. L., Wallace, E. & Singh, S. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics, Online, Nov. 2020), 4222–4235. https://aclanthology.org/2020...

  68. [68]

    & Lim, S.-N

    Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B. & Lim, S.-N. Visual prompt tuning in European Conference on Computer Vision (2022), 709–727

  69. [69]

    Li, X. L. & Liang, P . Prefix-Tuning: Optimizing Continuous Prompts for Generationin Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Association for Computational Linguistics, Online, Aug. 2021), 4582–4597. https://ac...

  70. [70]

    & Chen, H

    Chen, X., Zhang, N., Xie, X., Deng, S., Yao, Y., Tan, C., Huang, F., Si, L. & Chen, H. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction in Proceedings of the ACM Web conference 2022 (2022), 2778–2788

  71. [71]

    & Khani, F

    Ye, Q., Axmed, M., Pryzant, R. & Khani, F. Prompt engineering a prompt engineer. arXiv preprint arXiv:2311.05661 (2023)

  72. [72]

    Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.arXiv preprint arXiv:2212.14024, 2022

    Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P ., Potts, C. & Zaharia, M. Demonstrate-Search- Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP. arXiv preprint arXiv:2212.14024 (2022)

  73. [73]

    Differentiation

    Singhvi, A., Shetty, M., Tan, S., Potts, C., Sen, K., Zaharia, M. & Khattab, O. DSPy Assertions: Com- putational Constraints for Self-Refining Language Model Pipelines. arXiv preprint arXiv:2312.13382 (2023). 21 Automatic “Differentiation” via Text

  74. [74]

    V ., Zhou, D

    Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V ., Zhou, D. & Chen, X. Large Language Models as Optimizers in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/ forum?id=Bb4VGOWELI

  75. [75]

    T., Lee, C., Tang, Y

    Song, X., Tian, Y., Lange, R. T., Lee, C., Tang, Y. & Chen, Y. Position: Leverage Foundational Models for Black-Box Optimization 2024. arXiv: 2405.03547 [cs.LG]

  76. [76]

    & van der Schaar, M

    Liu, T., Astorga, N., Seedat, N. & van der Schaar, M. Large Language Models to Enhance Bayesian Opti- mization in The Twelfth International Conference on Learning Representations(2024). https://openreview. net/forum?id=OOxotBmGol

  77. [77]

    & Goodman, N

    Wang, R., Zelikman, E., Poesia, G., Pu, Y., Haber, N. & Goodman, N. Hypothesis Search: Inductive Rea- soning with Language Models in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=G7UtIGQmjm

  78. [78]

    T., Fan, Y., Zhao, V ., Lao, N., Lee, H., Juan, D.-C

    Gao, L., Dai, Z., Pasupat, P ., Chen, A., Chaganty, A. T., Fan, Y., Zhao, V ., Lao, N., Lee, H., Juan, D.-C. & Guu, K. RARR: Researching and Revising What Language Models Say, Using Language Models in Pro- ceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers) (Association for Computational Linguisti...

  79. [79]

    & Zhou, D

    Chen, X., Lin, M., Schärli, N. & Zhou, D. Teaching Large Language Models to Self-Debug in The Twelfth International Conference on Learning Representations (2024). https : / / openreview . net / forum ? id = KuPixIqPiq

  80. [80]

    G., Madaan, A., Zeng, Y., Alon, U., Gardner, J

    Shypula, A. G., Madaan, A., Zeng, Y., Alon, U., Gardner, J. R., Yang, Y., Hashemi, M., Neubig, G., Ranganathan, P ., Bastani, O. & Yazdanbakhsh, A. Learning Performance-Improving Code Edits in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum? id=ix7rLVHXyY

Showing first 80 references.