arxiv: 2406.07496 · v1 · submitted 2024-06-11 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul , Federico Bianchi , Joseph Boen , Sheng Liu , Zhi Huang , Carlos Guestrin , James Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 11:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords TextGradautomatic differentiationcompound AI systemstextual feedbackLLM optimizationbackpropagationPyTorch syntaxmulti-component AI

0 comments

The pith

TextGrad backpropagates LLM textual feedback to optimize individual components in compound AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TextGrad as a framework for automatic differentiation through text. It allows LLMs to supply natural language feedback that updates variables across a computation graph of AI components. This mirrors the role of backpropagation in making neural network training automated and scalable. Readers would care if the approach generalizes because it could convert ad-hoc system building into a more systematic optimization process for multi-model applications.

Core claim

TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system, following PyTorch syntax and abstraction, and works out-of-the-box across tasks from question answering and coding to molecule design and radiotherapy planning.

What carries the argument

The TextGrad framework, which uses LLMs to generate natural language suggestions that serve as gradients for optimizing variables in a computation graph.

If this is right

Zero-shot accuracy of GPT-4o on Google-Proof Question Answering rises from 51% to 55%.
20% relative performance gain on LeetCode-Hard coding problem solutions.
New druglike small molecules are designed with desirable in silico binding.
Radiation oncology treatment plans are produced with high specificity.
Reasoning prompts improve without any framework modifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the feedback mechanism holds, entire multi-agent pipelines could be tuned with minimal human input.
Hybrid extensions might combine textual feedback with numerical gradients in existing ML libraries.
Limits may appear when scaling to graphs with hundreds of interdependent components.
The method could apply to non-AI domains where structured variables admit natural language descriptions.

Load-bearing premise

LLM-generated natural language feedback is sufficiently general, consistent, and actionable to drive reliable optimization across domains without domain-specific prompt engineering or component tuning.

What would settle it

Apply TextGrad unchanged to a new domain such as quantum circuit design and measure whether performance gains exceed those from manual prompting baselines.

read the original abstract

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TextGrad lets LLMs supply natural-language feedback to tweak parts of compound AI systems in a PyTorch-style graph, with demos on QA, code, molecules, and treatment planning but thin checks on whether the feedback stays consistent.

read the letter

TextGrad treats LLM-generated text as a stand-in for gradients so you can optimize variables inside a larger AI pipeline without hand-tuning each piece. The framework copies PyTorch syntax, lets users supply only an objective, and then has the LLM suggest changes to code snippets, molecular structures, or other components. The paper walks through four applications and reports concrete lifts: GPT-4o moves from 51 % to 55 % on GPQA, LeetCode-Hard solutions improve by about 20 % relative, and the same setup produces drug-like molecules and radiotherapy plans with better in-silico properties. Those examples are the clearest part of the work; they show the interface is usable out of the box across domains that normally need separate engineering effort. The main limitation is the missing validation. No ablations appear, error bars are absent, and there is little discussion of how sensitive the gains are to the choice of feedback model, temperature, or exact system prompt. Because LLMs are stochastic, any unmeasured dependence on those choices would make the reported improvements hard to reproduce. The paper does not supply a formal argument for why textual suggestions propagate reliably through the graph either. Readers who build multi-component systems will find the syntax and the breadth of examples useful as a starting point. The work is coherent on its own terms and shows clear thinking about the problem, so it belongs in peer review where referees can press on the experimental controls and reproducibility details.

Referee Report

3 major / 1 minor

Summary. The paper introduces TextGrad, a framework for automatic differentiation via text that backpropagates natural-language feedback generated by LLMs to optimize individual components (e.g., code, prompts, molecular structures) within compound AI systems. Following PyTorch-like syntax, it claims to require only an objective function from the user and no prompt or component tuning, with empirical gains reported on GPQA (51% to 55% zero-shot GPT-4o accuracy), LeetCode-Hard (20% relative improvement), prompt optimization for reasoning, in silico molecule design, and radiotherapy treatment planning.

Significance. If the reported gains prove robust and reproducible, TextGrad would represent a significant step toward general, turn-key optimization methods for multi-component AI systems, analogous to backpropagation's role in neural networks. The cross-domain demonstrations (coding, QA, molecular design, medical planning) without domain-specific engineering support the claimed generality and could accelerate development of orchestrated LLM systems.

major comments (3)

[Experiments] Experiments (results on GPQA, LeetCode-Hard, etc.): concrete performance gains are reported without error bars, ablation studies on feedback-LLM choice, temperature, or system-prompt variants, and without explicit baseline-construction details. This directly weakens the central claim of reliable, out-of-the-box optimization, as LLM feedback is known to be stochastic and prompt-sensitive.
[Methods] Framework and Methods: no formal argument, propagation analysis, or counterexample testing is supplied to show why textual feedback reliably traverses the computation graph for variables ranging from code to molecular structures. The weakest assumption (general, consistent, actionable LLM feedback without hidden tuning) therefore remains untested.
[Implementation] Implementation details: the assertion of zero prompt or component tuning is not accompanied by variance measurements or sensitivity analysis on the feedback-generation step, leaving open whether reported improvements depend on unstated choices of the feedback LLM or exact prompt templates.

minor comments (1)

[Framework] Notation: the analogy to PyTorch is helpful but the precise mapping from textual feedback to variable updates could be clarified with a small pseudocode example in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We appreciate the recognition of TextGrad's potential impact and the specific concerns raised about experimental robustness, methodological assumptions, and implementation transparency. We address each major comment below and will revise the manuscript to strengthen these aspects.

read point-by-point responses

Referee: [Experiments] Experiments (results on GPQA, LeetCode-Hard, etc.): concrete performance gains are reported without error bars, ablation studies on feedback-LLM choice, temperature, or system-prompt variants, and without explicit baseline-construction details. This directly weakens the central claim of reliable, out-of-the-box optimization, as LLM feedback is known to be stochastic and prompt-sensitive.

Authors: We agree that error bars, ablations, and explicit baseline details are important for substantiating the reliability claims. In the revised manuscript, we will add error bars computed over at least five independent runs with different random seeds for all reported results. We will include ablations varying the feedback LLM (e.g., GPT-4o, GPT-3.5-turbo, Claude-3), temperature settings (0.0, 0.5, 1.0), and system-prompt variants. We will also expand the experimental section with precise descriptions of baseline construction, including any prompts or procedures used for comparison methods, to demonstrate that improvements hold under the out-of-the-box setting. revision: yes
Referee: [Methods] Framework and Methods: no formal argument, propagation analysis, or counterexample testing is supplied to show why textual feedback reliably traverses the computation graph for variables ranging from code to molecular structures. The weakest assumption (general, consistent, actionable LLM feedback without hidden tuning) therefore remains untested.

Authors: We acknowledge that the current manuscript lacks a formal theoretical analysis of feedback propagation. The design is motivated by the empirical analogy to backpropagation, and we demonstrate successful optimization across four heterogeneous domains (reasoning, coding, molecular design, and treatment planning) where variables differ substantially in structure. In revision, we will add a new subsection discussing the core assumptions, including when LLM feedback may fail to be actionable, and we will include observed counterexamples or failure modes from our development process to better delineate the method's scope and limitations. revision: partial
Referee: [Implementation] Implementation details: the assertion of zero prompt or component tuning is not accompanied by variance measurements or sensitivity analysis on the feedback-generation step, leaving open whether reported improvements depend on unstated choices of the feedback LLM or exact prompt templates.

Authors: We will revise the implementation and experimental sections to clarify that the framework relies on a small set of fixed, general-purpose prompts for feedback generation that are not tuned per task. To address sensitivity concerns, we will report variance measurements across different feedback LLMs and minor prompt variations. We will also include the exact prompt templates in the supplementary material and open-source code release, enabling readers to assess and reproduce the sensitivity of results to these choices. revision: yes

Circularity Check

0 steps flagged

No circularity in TextGrad framework claims or results

full rationale

The paper presents TextGrad as a new textual backpropagation framework that uses LLM-generated natural language feedback to optimize components in compound AI systems. Claims rest on empirical demonstrations (e.g., accuracy gains on GPQA and LeetCode) rather than any mathematical derivation chain, fitted parameters renamed as predictions, or self-referential definitions. No equations appear, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results is used to establish the core method. The framework is described as following PyTorch syntax with out-of-the-box applicability, supported by reported experimental outcomes across domains. This is self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can supply general, useful textual feedback for optimization; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption LLMs can provide rich, general, natural language suggestions that improve variables in computation graphs
This premise is required for the backpropagation-via-text mechanism to function without additional tuning.

pith-pipeline@v0.9.0 · 5605 in / 1230 out tokens · 47459 ms · 2026-05-13T11:23:28.588133+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
cs.AI 2026-05 conditional novelty 7.0

Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- ...
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
cs.SE 2026-05 unverdicted novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design
cs.LG 2026-04 unverdicted novelty 7.0

RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than...
Meta-Harness: End-to-End Optimization of Model Harnesses
cs.AI 2026-03 unverdicted novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
Automated Design of Agentic Systems
cs.AI 2024-08 conditional novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
cs.AI 2026-05 unverdicted novelty 6.0

PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
cs.CR 2026-05 unverdicted novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems
cs.LG 2026-05 unverdicted novelty 6.0

AgentSlimming compresses graph-structured multi-agent systems by estimating agent importance and removing or replacing low-value agents, cutting token costs by up to 78.9% with negligible performance loss.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization
cs.AI 2026-04 unverdicted novelty 6.0

TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
cs.AI 2026-04 unverdicted novelty 6.0

ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
cs.AI 2026-04 unverdicted novelty 6.0

SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weigh...
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
cs.AI 2026-04 unverdicted novelty 6.0

POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
Scaling Multi-agent Systems: A Smart Middleware for Improving Agent Interactions
cs.MA 2026-04 unverdicted novelty 6.0

Cognitive Fabric Nodes middleware improves multi-agent LLM system performance by over 10% on HotPotQA and MuSiQue datasets by elevating memory to an active substrate for topology selection, semantic grounding, securit...
Reflective Context Learning: Studying the Optimization Primitives of Context Space
cs.LG 2026-04 unverdicted novelty 6.0

Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...
Self-Optimizing Multi-Agent Systems for Deep Research
cs.IR 2026-04 unverdicted novelty 6.0

Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
UNBOX: Unveiling Black-box visual models with Natural-language
cs.CV 2026-03 unverdicted novelty 6.0

UNBOX recovers interpretable text concepts that maximally activate classes in black-box vision models by recasting activation maximization as semantic search with LLMs and diffusion models.
Memory in the Age of AI Agents
cs.CL 2025-12 unverdicted novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement
cs.CL 2026-05 unverdicted novelty 5.0

LANCE applies variational inference for label enhancement across multiple rejection categories, supplying gradients to a refinement model that produces safe, non-rigid responses from LLMs.
Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
cs.CL 2026-04 unverdicted novelty 5.0

AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.
Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt
cs.CL 2026-04 unverdicted novelty 4.0

A small language model resolves semantic risks and conflicts in prompts via multi-perspective consistency checks, yielding a 2.5-point gain in LLM reasoning performance at $0.02 cost.
Supplement Generation Training for Enhancing Agentic Task Performance
cs.LG 2026-04 unverdicted novelty 4.0

SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
Statistical Software Engineering with Tuned Variables
cs.SE 2026-04 unverdicted novelty 4.0

AI system maintenance requires treating configuration choices as versioned governed tuned variables promoted via statistical evidence from sampled evaluations.
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO
cs.CL 2026-04 unverdicted novelty 3.0

Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 29 Pith papers · 7 internal anchors

[1]

D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A.,et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A.,et al. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)

work page 1901
[2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., Alayrac, J.-b., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Llama 3 Model Card

AI@Meta. Llama 3 Model Card. https://github.com/meta- llama/llama3/blob/main/MODEL_ CARD.md (2024)

work page 2024
[5]

The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic, A. The Claude 3 Model Family: Opus, Sonnet, Haiku. Claude-3 Model Card (2024)

work page 2024
[6]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

H., Wu, Y., Le, Q

Trinh, T. H., Wu, Y., Le, Q. V ., He, H. & Luong, T. Solving olympiad geometry without human demon- strations. Nature 625, 476–482 (2024)

work page 2024
[8]

Competition-level code generation with alphacode

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science 378, 1092–1097 (2022)

work page 2022
[9]

E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. & Press, O. SWE-agent: Agent- Computer Interfaces Enable Automated Software Engineering 2024

work page 2024
[10]

V ., Haq, S., Sharma, A., Joshi, T

Khattab, O., Singhvi, A., Maheshwari, P ., Zhang, Z., Santhanam, K., A, S. V ., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M. & Potts, C.DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=sY5N0zY5Od

work page 2024
[11]

Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N

Zaharia, M., Khattab, O., Chen, L., Davis, J. Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N. & Ghodsi, A. The Shift from Models to Compound AI Systems https://bair.berkeley.edu/ blog/2024/02/18/compound-ai-systems/. 2024

work page 2024
[12]

I., Han, Z., Paster, K., Pitis, S., Chan, H

Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H. & Ba, J. Large Language Models are Human-Level Prompt Engineers in The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=92gvk82DE-

work page 2023
[13]

& Hinton, G

Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)

work page 2012
[14]

Highly accurate protein structure prediction with AlphaFold

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021)

work page 2021
[15]

J., Schrittwieser, J., Swirszcz, G., et al

Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 47–53 (2022)

work page 2022
[16]

J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., et al

Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., et al. Faster sorting algorithms discovered using deep reinforcement learn- ing. Nature 618, 257–263 (2023)

work page 2023
[17]

S., Aykol, M., Cheon, G

Merchant, A., Batzner, S., Schoenholz, S. S., Aykol, M., Cheon, G. & Cubuk, E. D. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023)

work page 2023
[18]

& Courville, A

Goodfellow, I., Bengio, Y. & Courville, A. Deep learning (MIT press, 2016)

work page 2016
[19]

Differentiation

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986). 18 Automatic “Differentiation” via Text

work page 1986
[20]

& Darrell, T

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S. & Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014)

work page arXiv 2014
[21]

& Bengio, Y

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P ., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D. & Bengio, Y. Theano: A CPU and GPU Math Expression Compiler in Proceedings of the Python for Scientific Computing Conference (SciPy) (2010)

work page 2010
[22]

TensorFlow: A System for Large-Scale Machine Learning in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (2016), 265–283

Abadi, M., Barham, P ., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. TensorFlow: A System for Large-Scale Machine Learning in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (2016), 265–283

work page 2016
[23]

Pytorch: An imperative style, high-performance deep learning library

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)

work page 2019
[24]

& Mariéthoz, J

Collobert, R., Bengio, S. & Mariéthoz, J. Torch: a modular machine learning software library (2002)

work page 2002
[25]

Gradient Descent

Pryzant, R., Iter, D., Li, J., Lee, Y., Zhu, C. & Zeng, M. Automatic Prompt Optimization with “Gradient Descent” and Beam Search in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H., Pino, J. & Bali, K.) (Association for Computational Linguistics, Singa- pore, Dec. 2023), 7957–7968. https://aclantholog...

work page 2023
[26]

& Yao, S

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: language agents with ver- bal reinforcement learning in Advances in Neural Information Processing Systems 36 (2023). https : / / proceedings.neurips.cc/paper_files/paper/2023/file/1b44b878bb782e6954cd888628510e90- Paper-Conference.pdf

work page 2023
[27]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J. & Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

& Hashimoto, T

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P . & Hashimoto, T. B.Alpacae- val: An automatic evaluator of instruction-following models 2023

work page 2023
[29]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T.,et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Self-refine: Iterative refinement with self-feedback

Madaan, A., Tandon, N., Gupta, P ., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhu- moye, S., Yang, Y., et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Infor- mation Processing Systems 36 (2024)

work page 2024
[31]

& Christiano, P

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D. & Christiano, P . F. Learning to summarize with human feedback.Advances in Neural Information Processing Systems 33, 3008–3021 (2020)

work page 2020
[32]

Self-Rewarding Language Models

Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J. & Weston, J. Self-rewarding language models. arXiv preprint arXiv:2401.10020 (2024)

work page internal anchor Pith review arXiv 2024
[33]

X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P

Dubois, Y., Li, C. X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P . S. & Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[34]

Large-scale machine learning with stochastic gradient descent

Bottou, L. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT’2010, 177–186 (2010)

work page 2010
[35]

Boyd, S., Boyd, S. P . & Vandenberghe, L. Convex optimization (Cambridge university press, 2004)

work page 2004
[36]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, 27730–27744 (2022)

work page 2022
[37]

Differentiation

Wei, J., Bosma, M., Zhao, V ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M. & Le, Q. V . Finetuned Language Models are Zero-Shot Learners in International Conference on Learning Representations (2022). https://openreview.net/forum?id=gEZrGCozdqR. 19 Automatic “Differentiation” via Text

work page 2022
[38]

& Nushi, B

Yuksekgonul, M., Chandrasekaran, V ., Jones, E., Gunasekar, S., Naik, R., Palangi, H., Kamar, E. & Nushi, B. Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum? id=gfFVATffPd

work page 2024
[39]

I., Gunasekar, S., Chandrasekaran, V ., Li, J., Yuksekgonul, M., Peshawaria, R

Abdin, M. I., Gunasekar, S., Chandrasekaran, V ., Li, J., Yuksekgonul, M., Peshawaria, R. G., Naik, R. & Nushi, B. KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval in The Twelfth International Conference on Learning Representations (2024). https : / / openreview . net / forum ? id = b3kDP3IytM

work page 2024
[40]

Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computa- tional Mathematics and Mathematical Physics 4, 1–17 (1964)

work page 1964
[41]

& Hinton, G

Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning in International conference on machine learning (2013), 1139–1147

work page 2013
[42]

& Hardt, M

Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A. & Hardt, M. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts in Proceedings of the 37th International Conference on Machine Learning (PMLR, 2020). https://proceedings.mlr.press/v119/sun20b.html

work page 2020
[43]

Learning to (learn at test time)

Sun, Y., Li, X., Dalal, K., Hsu, C., Koyejo, S., Guestrin, C., Wang, X., Hashimoto, T. & Chen, X. Learning to (learn at test time). arXiv preprint arXiv:2310.13807 (2023)

work page arXiv 2023
[44]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R. & Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Modelsin The Eleventh International Conference on Learning Representations(2023). https://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[45]

& Steinhardt, J

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. Measuring Massive Multitask Language Understanding in International Conference on Learning Representations(2021). https: //openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[46]

S., Reid, M., Matsuo, Y

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reason- ers. Advances in neural information processing systems 35, 22199–22213 (2022)

work page 2022
[47]

V ., Zhou, D., et al

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information process- ing systems 35, 24824–24837 (2022)

work page 2022
[48]

Hello GPT-4o Accessed: 2024-05-18

OpenAI. Hello GPT-4o Accessed: 2024-05-18. 2024. https://openai.com/index/hello-gpt-4o/

work page 2024
[49]

& Neubig, G

Liu, P ., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. & Neubig, G. Pre-train, prompt, and predict: A system- atic survey of prompting methods in natural language processing. ACM Computing Surveys 55, 1–35 (2023)

work page 2023
[50]

W., Chowdhery, A., Le, Q., Chi, E., Zhou, D

Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D. & Wei, J. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them in Findings of the Association for Computational Linguistics: ACL 2023 (Association for Computational Linguistics, Toronto, Canada, July 2023). https://aclantho...

work page 2023
[51]

Srivastava, A. et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research. ISSN : 2835-8856. https://openreview. net/forum?id=uyTL5Bvosj (2023)

work page 2023
[52]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C. & Schulman, J. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

Nicolaou, C. A. & Brown, N. Multi-objective optimization methods in drug design. Drug Discovery Today: Technologies10, e427–e435 (2013)

work page 2013
[54]

Hoelder, S., Clarke, P . A. & Workman, P . Discovery of small molecule cancer drugs: successes, chal- lenges and opportunities. Molecular oncology 6, 155–176 (2012)

work page 2012
[55]

Differentiation

Kontoyianni, M. Docking and virtual screening in drug discovery. Proteomics for drug discovery: Meth- ods and protocols, 255–266 (2017). 20 Automatic “Differentiation” via Text

work page 2017
[56]

& Mehrotra, R

Agarwal, S. & Mehrotra, R. An overview of molecular docking. JSM chem 4, 1024–1028 (2016)

work page 2016
[57]

& Olson, A

Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of computational chemistry31, 455– 461 (2010)

work page 2010
[58]

& Oprea, T

Ursu, O., Rayan, A., Goldblum, A. & Oprea, T. I. Understanding drug-likeness. Wiley Interdisciplinary Reviews: Computational Molecular Science 1, 760–781 (2011)

work page 2011
[59]

R., Paolini, G

Bickerton, G. R., Paolini, G. V ., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nature chemistry 4, 90–98 (2012)

work page 2012
[60]

J., Gahbauer, S., Luttens, A., Lyu, J., Webb, C

Bender, B. J., Gahbauer, S., Luttens, A., Lyu, J., Webb, C. M., Stein, R. M., Fink, E. A., Balius, T. E., Carlsson, J., Irwin, J. J., et al. A practical guide to large-scale docking. Nature protocols 16, 4799–4832 (2021)

work page 2021
[61]

N., Tripp, A

García-Ortegón, M., Simm, G. N., Tripp, A. J., Hernández-Lobato, J. M., Bender, A. & Bacallado, S. DOCKSTRING: easy molecular docking yields better benchmarks for ligand design. Journal of chemi- cal information and modeling 62, 3486–3502 (2022)

work page 2022
[62]

M., Sperduto, P

Khan, F. M., Sperduto, P . W. & Gibbons, J. P .Khan’s Treatment Planning in Radiation Oncology:.(Lippin- cott Williams & Wilkins, 2021)

work page 2021
[63]

The physical basis of IMRT and inverse planning

Webb, S. The physical basis of IMRT and inverse planning. The British journal of radiology 76, 678–689 (2003)

work page 2003
[64]

Hussein, M., Heijmen, B. J. M., Verellen, D. & Nisbet, A. Automation in Intensity Modulated Radio- therapy Treatment Planning—a Review of Recent Innovations. British Journal of Radiology91, 20180270. ISSN : 0007-1285. (2024) (Dec. 2018)

work page 2024
[65]

Development of the open-source dose calculation and optimization toolkit matRad

Wieser, H.-P ., Cisternas, E., Wahl, N., Ulrich, S., Stadler, A., Mescher, H., Müller, L.-R., Klinge, T., Gabrys, H., Burigo, L., et al. Development of the open-source dose calculation and optimization toolkit matRad. Medical Physics 44, 2556–2568 (2017)

work page 2017
[66]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine

Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452 (2023)

work page arXiv 2023
[67]

L., Wallace, E

Shin, T., Razeghi, Y., Logan IV , R. L., Wallace, E. & Singh, S. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics, Online, Nov. 2020), 4222–4235. https://aclanthology.org/2020...

work page 2020
[68]

& Lim, S.-N

Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B. & Lim, S.-N. Visual prompt tuning in European Conference on Computer Vision (2022), 709–727

work page 2022
[69]

Li, X. L. & Liang, P . Prefix-Tuning: Optimizing Continuous Prompts for Generationin Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Association for Computational Linguistics, Online, Aug. 2021), 4582–4597. https://ac...

work page 2021
[70]

& Chen, H

Chen, X., Zhang, N., Xie, X., Deng, S., Yao, Y., Tan, C., Huang, F., Si, L. & Chen, H. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction in Proceedings of the ACM Web conference 2022 (2022), 2778–2788

work page 2022
[71]

& Khani, F

Ye, Q., Axmed, M., Pryzant, R. & Khani, F. Prompt engineering a prompt engineer. arXiv preprint arXiv:2311.05661 (2023)

work page arXiv 2023
[72]

arXiv preprint arXiv:2212.14024 (2022)

Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P ., Potts, C. & Zaharia, M. Demonstrate-Search- Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP. arXiv preprint arXiv:2212.14024 (2022)

work page arXiv 2022
[73]

Differentiation

Singhvi, A., Shetty, M., Tan, S., Potts, C., Sen, K., Zaharia, M. & Khattab, O. DSPy Assertions: Com- putational Constraints for Self-Refining Language Model Pipelines. arXiv preprint arXiv:2312.13382 (2023). 21 Automatic “Differentiation” via Text

work page arXiv 2023
[74]

V ., Zhou, D

Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V ., Zhou, D. & Chen, X. Large Language Models as Optimizers in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/ forum?id=Bb4VGOWELI

work page 2024
[75]

T., Lee, C., Tang, Y

Song, X., Tian, Y., Lange, R. T., Lee, C., Tang, Y. & Chen, Y. Position: Leverage Foundational Models for Black-Box Optimization 2024. arXiv: 2405.03547 [cs.LG]

work page arXiv 2024
[76]

& van der Schaar, M

Liu, T., Astorga, N., Seedat, N. & van der Schaar, M. Large Language Models to Enhance Bayesian Opti- mization in The Twelfth International Conference on Learning Representations(2024). https://openreview. net/forum?id=OOxotBmGol

work page 2024
[77]

& Goodman, N

Wang, R., Zelikman, E., Poesia, G., Pu, Y., Haber, N. & Goodman, N. Hypothesis Search: Inductive Rea- soning with Language Models in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=G7UtIGQmjm

work page 2024
[78]

T., Fan, Y., Zhao, V ., Lao, N., Lee, H., Juan, D.-C

Gao, L., Dai, Z., Pasupat, P ., Chen, A., Chaganty, A. T., Fan, Y., Zhao, V ., Lao, N., Lee, H., Juan, D.-C. & Guu, K. RARR: Researching and Revising What Language Models Say, Using Language Models in Pro- ceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers) (Association for Computational Linguisti...

work page 2023
[79]

& Zhou, D

Chen, X., Lin, M., Schärli, N. & Zhou, D. Teaching Large Language Models to Self-Debug in The Twelfth International Conference on Learning Representations (2024). https : / / openreview . net / forum ? id = KuPixIqPiq

work page 2024
[80]

G., Madaan, A., Zeng, Y., Alon, U., Gardner, J

Shypula, A. G., Madaan, A., Zeng, Y., Alon, U., Gardner, J. R., Yang, Y., Hashemi, M., Neubig, G., Ranganathan, P ., Bastani, O. & Yazdanbakhsh, A. Learning Performance-Improving Code Edits in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum? id=ix7rLVHXyY

work page 2024

Showing first 80 references.