Recognition: no theorem link
TextGrad: Automatic "Differentiation" via Text
Pith reviewed 2026-05-13 11:23 UTC · model grok-4.3
The pith
TextGrad backpropagates LLM textual feedback to optimize individual components in compound AI systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system, following PyTorch syntax and abstraction, and works out-of-the-box across tasks from question answering and coding to molecule design and radiotherapy planning.
What carries the argument
The TextGrad framework, which uses LLMs to generate natural language suggestions that serve as gradients for optimizing variables in a computation graph.
If this is right
- Zero-shot accuracy of GPT-4o on Google-Proof Question Answering rises from 51% to 55%.
- 20% relative performance gain on LeetCode-Hard coding problem solutions.
- New druglike small molecules are designed with desirable in silico binding.
- Radiation oncology treatment plans are produced with high specificity.
- Reasoning prompts improve without any framework modifications.
Where Pith is reading between the lines
- If the feedback mechanism holds, entire multi-agent pipelines could be tuned with minimal human input.
- Hybrid extensions might combine textual feedback with numerical gradients in existing ML libraries.
- Limits may appear when scaling to graphs with hundreds of interdependent components.
- The method could apply to non-AI domains where structured variables admit natural language descriptions.
Load-bearing premise
LLM-generated natural language feedback is sufficiently general, consistent, and actionable to drive reliable optimization across domains without domain-specific prompt engineering or component tuning.
What would settle it
Apply TextGrad unchanged to a new domain such as quantum circuit design and measure whether performance gains exceed those from manual prompting baselines.
read the original abstract
AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TextGrad, a framework for automatic differentiation via text that backpropagates natural-language feedback generated by LLMs to optimize individual components (e.g., code, prompts, molecular structures) within compound AI systems. Following PyTorch-like syntax, it claims to require only an objective function from the user and no prompt or component tuning, with empirical gains reported on GPQA (51% to 55% zero-shot GPT-4o accuracy), LeetCode-Hard (20% relative improvement), prompt optimization for reasoning, in silico molecule design, and radiotherapy treatment planning.
Significance. If the reported gains prove robust and reproducible, TextGrad would represent a significant step toward general, turn-key optimization methods for multi-component AI systems, analogous to backpropagation's role in neural networks. The cross-domain demonstrations (coding, QA, molecular design, medical planning) without domain-specific engineering support the claimed generality and could accelerate development of orchestrated LLM systems.
major comments (3)
- [Experiments] Experiments (results on GPQA, LeetCode-Hard, etc.): concrete performance gains are reported without error bars, ablation studies on feedback-LLM choice, temperature, or system-prompt variants, and without explicit baseline-construction details. This directly weakens the central claim of reliable, out-of-the-box optimization, as LLM feedback is known to be stochastic and prompt-sensitive.
- [Methods] Framework and Methods: no formal argument, propagation analysis, or counterexample testing is supplied to show why textual feedback reliably traverses the computation graph for variables ranging from code to molecular structures. The weakest assumption (general, consistent, actionable LLM feedback without hidden tuning) therefore remains untested.
- [Implementation] Implementation details: the assertion of zero prompt or component tuning is not accompanied by variance measurements or sensitivity analysis on the feedback-generation step, leaving open whether reported improvements depend on unstated choices of the feedback LLM or exact prompt templates.
minor comments (1)
- [Framework] Notation: the analogy to PyTorch is helpful but the precise mapping from textual feedback to variable updates could be clarified with a small pseudocode example in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We appreciate the recognition of TextGrad's potential impact and the specific concerns raised about experimental robustness, methodological assumptions, and implementation transparency. We address each major comment below and will revise the manuscript to strengthen these aspects.
read point-by-point responses
-
Referee: [Experiments] Experiments (results on GPQA, LeetCode-Hard, etc.): concrete performance gains are reported without error bars, ablation studies on feedback-LLM choice, temperature, or system-prompt variants, and without explicit baseline-construction details. This directly weakens the central claim of reliable, out-of-the-box optimization, as LLM feedback is known to be stochastic and prompt-sensitive.
Authors: We agree that error bars, ablations, and explicit baseline details are important for substantiating the reliability claims. In the revised manuscript, we will add error bars computed over at least five independent runs with different random seeds for all reported results. We will include ablations varying the feedback LLM (e.g., GPT-4o, GPT-3.5-turbo, Claude-3), temperature settings (0.0, 0.5, 1.0), and system-prompt variants. We will also expand the experimental section with precise descriptions of baseline construction, including any prompts or procedures used for comparison methods, to demonstrate that improvements hold under the out-of-the-box setting. revision: yes
-
Referee: [Methods] Framework and Methods: no formal argument, propagation analysis, or counterexample testing is supplied to show why textual feedback reliably traverses the computation graph for variables ranging from code to molecular structures. The weakest assumption (general, consistent, actionable LLM feedback without hidden tuning) therefore remains untested.
Authors: We acknowledge that the current manuscript lacks a formal theoretical analysis of feedback propagation. The design is motivated by the empirical analogy to backpropagation, and we demonstrate successful optimization across four heterogeneous domains (reasoning, coding, molecular design, and treatment planning) where variables differ substantially in structure. In revision, we will add a new subsection discussing the core assumptions, including when LLM feedback may fail to be actionable, and we will include observed counterexamples or failure modes from our development process to better delineate the method's scope and limitations. revision: partial
-
Referee: [Implementation] Implementation details: the assertion of zero prompt or component tuning is not accompanied by variance measurements or sensitivity analysis on the feedback-generation step, leaving open whether reported improvements depend on unstated choices of the feedback LLM or exact prompt templates.
Authors: We will revise the implementation and experimental sections to clarify that the framework relies on a small set of fixed, general-purpose prompts for feedback generation that are not tuned per task. To address sensitivity concerns, we will report variance measurements across different feedback LLMs and minor prompt variations. We will also include the exact prompt templates in the supplementary material and open-source code release, enabling readers to assess and reproduce the sensitivity of results to these choices. revision: yes
Circularity Check
No circularity in TextGrad framework claims or results
full rationale
The paper presents TextGrad as a new textual backpropagation framework that uses LLM-generated natural language feedback to optimize components in compound AI systems. Claims rest on empirical demonstrations (e.g., accuracy gains on GPQA and LeetCode) rather than any mathematical derivation chain, fitted parameters renamed as predictions, or self-referential definitions. No equations appear, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results is used to establish the core method. The framework is described as following PyTorch syntax with out-of-the-box applicability, supported by reported experimental outcomes across domains. This is self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can provide rich, general, natural language suggestions that improve variables in computation graphs
Forward citations
Cited by 30 Pith papers
-
Harnessing Agentic Evolution
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- ...
-
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
-
RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design
RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than...
-
Meta-Harness: End-to-End Optimization of Model Harnesses
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
-
Automated Design of Agentic Systems
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
-
AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems
AgentSlimming compresses graph-structured multi-agent systems by estimating agent importance and removing or replacing low-value agents, cutting token costs by up to 78.9% with negligible performance loss.
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
-
Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization
TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
-
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weigh...
-
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
Scaling Multi-agent Systems: A Smart Middleware for Improving Agent Interactions
Cognitive Fabric Nodes middleware improves multi-agent LLM system performance by over 10% on HotPotQA and MuSiQue datasets by elevating memory to an active substrate for topology selection, semantic grounding, securit...
-
Reflective Context Learning: Studying the Optimization Primitives of Context Space
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...
-
Self-Optimizing Multi-Agent Systems for Deep Research
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
-
UNBOX: Unveiling Black-box visual models with Natural-language
UNBOX recovers interpretable text concepts that maximally activate classes in black-box vision models by recasting activation maximization as semantic search with LLMs and diffusion models.
-
Memory in the Age of AI Agents
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
-
Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement
LANCE applies variational inference for label enhancement across multiple rejection categories, supplying gradients to a refinement model that produces safe, non-rigid responses from LLMs.
-
Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.
-
Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt
A small language model resolves semantic risks and conflicts in prompts via multi-perspective consistency checks, yielding a 2.5-point gain in LLM reasoning performance at $0.02 cost.
-
Supplement Generation Training for Enhancing Agentic Task Performance
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
-
Statistical Software Engineering with Tuned Variables
AI system maintenance requires treating configuration choices as versioned governed tuned variables promoted via statistical evidence from sampled evaluations.
-
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO
Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.
Reference graph
Works this paper leans on
-
[1]
D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A.,et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A.,et al. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
work page 1901
-
[2]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., Alayrac, J.-b., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
AI@Meta. Llama 3 Model Card. https://github.com/meta- llama/llama3/blob/main/MODEL_ CARD.md (2024)
work page 2024
-
[5]
The Claude 3 Model Family: Opus, Sonnet, Haiku
Anthropic, A. The Claude 3 Model Family: Opus, Sonnet, Haiku. Claude-3 Model Card (2024)
work page 2024
-
[6]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Trinh, T. H., Wu, Y., Le, Q. V ., He, H. & Luong, T. Solving olympiad geometry without human demon- strations. Nature 625, 476–482 (2024)
work page 2024
-
[8]
Competition-level code generation with alphacode
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science 378, 1092–1097 (2022)
work page 2022
-
[9]
E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K
Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. & Press, O. SWE-agent: Agent- Computer Interfaces Enable Automated Software Engineering 2024
work page 2024
-
[10]
V ., Haq, S., Sharma, A., Joshi, T
Khattab, O., Singhvi, A., Maheshwari, P ., Zhang, Z., Santhanam, K., A, S. V ., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M. & Potts, C.DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=sY5N0zY5Od
work page 2024
-
[11]
Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N
Zaharia, M., Khattab, O., Chen, L., Davis, J. Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N. & Ghodsi, A. The Shift from Models to Compound AI Systems https://bair.berkeley.edu/ blog/2024/02/18/compound-ai-systems/. 2024
work page 2024
-
[12]
I., Han, Z., Paster, K., Pitis, S., Chan, H
Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H. & Ba, J. Large Language Models are Human-Level Prompt Engineers in The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=92gvk82DE-
work page 2023
-
[13]
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)
work page 2012
-
[14]
Highly accurate protein structure prediction with AlphaFold
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021)
work page 2021
-
[15]
J., Schrittwieser, J., Swirszcz, G., et al
Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 47–53 (2022)
work page 2022
-
[16]
Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., et al. Faster sorting algorithms discovered using deep reinforcement learn- ing. Nature 618, 257–263 (2023)
work page 2023
-
[17]
Merchant, A., Batzner, S., Schoenholz, S. S., Aykol, M., Cheon, G. & Cubuk, E. D. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023)
work page 2023
-
[18]
Goodfellow, I., Bengio, Y. & Courville, A. Deep learning (MIT press, 2016)
work page 2016
-
[19]
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986). 18 Automatic “Differentiation” via Text
work page 1986
-
[20]
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S. & Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014)
-
[21]
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P ., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D. & Bengio, Y. Theano: A CPU and GPU Math Expression Compiler in Proceedings of the Python for Scientific Computing Conference (SciPy) (2010)
work page 2010
-
[22]
Abadi, M., Barham, P ., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. TensorFlow: A System for Large-Scale Machine Learning in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (2016), 265–283
work page 2016
-
[23]
Pytorch: An imperative style, high-performance deep learning library
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)
work page 2019
-
[24]
Collobert, R., Bengio, S. & Mariéthoz, J. Torch: a modular machine learning software library (2002)
work page 2002
-
[25]
Pryzant, R., Iter, D., Li, J., Lee, Y., Zhu, C. & Zeng, M. Automatic Prompt Optimization with “Gradient Descent” and Beam Search in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H., Pino, J. & Bali, K.) (Association for Computational Linguistics, Singa- pore, Dec. 2023), 7957–7968. https://aclantholog...
work page 2023
-
[26]
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: language agents with ver- bal reinforcement learning in Advances in Neural Information Processing Systems 36 (2023). https : / / proceedings.neurips.cc/paper_files/paper/2023/file/1b44b878bb782e6954cd888628510e90- Paper-Conference.pdf
work page 2023
-
[27]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J. & Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P . & Hashimoto, T. B.Alpacae- val: An automatic evaluator of instruction-following models 2023
work page 2023
-
[29]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T.,et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Self-refine: Iterative refinement with self-feedback
Madaan, A., Tandon, N., Gupta, P ., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhu- moye, S., Yang, Y., et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Infor- mation Processing Systems 36 (2024)
work page 2024
-
[31]
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D. & Christiano, P . F. Learning to summarize with human feedback.Advances in Neural Information Processing Systems 33, 3008–3021 (2020)
work page 2020
-
[32]
Self-Rewarding Language Models
Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J. & Weston, J. Self-rewarding language models. arXiv preprint arXiv:2401.10020 (2024)
work page internal anchor Pith review arXiv 2024
-
[33]
X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P
Dubois, Y., Li, C. X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P . S. & Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[34]
Large-scale machine learning with stochastic gradient descent
Bottou, L. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT’2010, 177–186 (2010)
work page 2010
-
[35]
Boyd, S., Boyd, S. P . & Vandenberghe, L. Convex optimization (Cambridge university press, 2004)
work page 2004
-
[36]
Training language models to follow instructions with human feedback
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, 27730–27744 (2022)
work page 2022
-
[37]
Wei, J., Bosma, M., Zhao, V ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M. & Le, Q. V . Finetuned Language Models are Zero-Shot Learners in International Conference on Learning Representations (2022). https://openreview.net/forum?id=gEZrGCozdqR. 19 Automatic “Differentiation” via Text
work page 2022
-
[38]
Yuksekgonul, M., Chandrasekaran, V ., Jones, E., Gunasekar, S., Naik, R., Palangi, H., Kamar, E. & Nushi, B. Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum? id=gfFVATffPd
work page 2024
-
[39]
I., Gunasekar, S., Chandrasekaran, V ., Li, J., Yuksekgonul, M., Peshawaria, R
Abdin, M. I., Gunasekar, S., Chandrasekaran, V ., Li, J., Yuksekgonul, M., Peshawaria, R. G., Naik, R. & Nushi, B. KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval in The Twelfth International Conference on Learning Representations (2024). https : / / openreview . net / forum ? id = b3kDP3IytM
work page 2024
-
[40]
Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computa- tional Mathematics and Mathematical Physics 4, 1–17 (1964)
work page 1964
-
[41]
Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning in International conference on machine learning (2013), 1139–1147
work page 2013
-
[42]
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A. & Hardt, M. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts in Proceedings of the 37th International Conference on Machine Learning (PMLR, 2020). https://proceedings.mlr.press/v119/sun20b.html
work page 2020
-
[43]
Learning to (learn at test time)
Sun, Y., Li, X., Dalal, K., Hsu, C., Koyejo, S., Guestrin, C., Wang, X., Hashimoto, T. & Chen, X. Learning to (learn at test time). arXiv preprint arXiv:2310.13807 (2023)
-
[44]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R. & Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Modelsin The Eleventh International Conference on Learning Representations(2023). https://openreview.net/forum?id=WE_vluYUL-X
work page 2023
-
[45]
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. Measuring Massive Multitask Language Understanding in International Conference on Learning Representations(2021). https: //openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[46]
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reason- ers. Advances in neural information processing systems 35, 22199–22213 (2022)
work page 2022
-
[47]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information process- ing systems 35, 24824–24837 (2022)
work page 2022
-
[48]
Hello GPT-4o Accessed: 2024-05-18
OpenAI. Hello GPT-4o Accessed: 2024-05-18. 2024. https://openai.com/index/hello-gpt-4o/
work page 2024
-
[49]
Liu, P ., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. & Neubig, G. Pre-train, prompt, and predict: A system- atic survey of prompting methods in natural language processing. ACM Computing Surveys 55, 1–35 (2023)
work page 2023
-
[50]
W., Chowdhery, A., Le, Q., Chi, E., Zhou, D
Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D. & Wei, J. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them in Findings of the Association for Computational Linguistics: ACL 2023 (Association for Computational Linguistics, Toronto, Canada, July 2023). https://aclantho...
work page 2023
-
[51]
Srivastava, A. et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research. ISSN : 2835-8856. https://openreview. net/forum?id=uyTL5Bvosj (2023)
work page 2023
-
[52]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C. & Schulman, J. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[53]
Nicolaou, C. A. & Brown, N. Multi-objective optimization methods in drug design. Drug Discovery Today: Technologies10, e427–e435 (2013)
work page 2013
-
[54]
Hoelder, S., Clarke, P . A. & Workman, P . Discovery of small molecule cancer drugs: successes, chal- lenges and opportunities. Molecular oncology 6, 155–176 (2012)
work page 2012
-
[55]
Kontoyianni, M. Docking and virtual screening in drug discovery. Proteomics for drug discovery: Meth- ods and protocols, 255–266 (2017). 20 Automatic “Differentiation” via Text
work page 2017
-
[56]
Agarwal, S. & Mehrotra, R. An overview of molecular docking. JSM chem 4, 1024–1028 (2016)
work page 2016
-
[57]
Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of computational chemistry31, 455– 461 (2010)
work page 2010
-
[58]
Ursu, O., Rayan, A., Goldblum, A. & Oprea, T. I. Understanding drug-likeness. Wiley Interdisciplinary Reviews: Computational Molecular Science 1, 760–781 (2011)
work page 2011
-
[59]
Bickerton, G. R., Paolini, G. V ., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nature chemistry 4, 90–98 (2012)
work page 2012
-
[60]
J., Gahbauer, S., Luttens, A., Lyu, J., Webb, C
Bender, B. J., Gahbauer, S., Luttens, A., Lyu, J., Webb, C. M., Stein, R. M., Fink, E. A., Balius, T. E., Carlsson, J., Irwin, J. J., et al. A practical guide to large-scale docking. Nature protocols 16, 4799–4832 (2021)
work page 2021
-
[61]
García-Ortegón, M., Simm, G. N., Tripp, A. J., Hernández-Lobato, J. M., Bender, A. & Bacallado, S. DOCKSTRING: easy molecular docking yields better benchmarks for ligand design. Journal of chemi- cal information and modeling 62, 3486–3502 (2022)
work page 2022
-
[62]
Khan, F. M., Sperduto, P . W. & Gibbons, J. P .Khan’s Treatment Planning in Radiation Oncology:.(Lippin- cott Williams & Wilkins, 2021)
work page 2021
-
[63]
The physical basis of IMRT and inverse planning
Webb, S. The physical basis of IMRT and inverse planning. The British journal of radiology 76, 678–689 (2003)
work page 2003
-
[64]
Hussein, M., Heijmen, B. J. M., Verellen, D. & Nisbet, A. Automation in Intensity Modulated Radio- therapy Treatment Planning—a Review of Recent Innovations. British Journal of Radiology91, 20180270. ISSN : 0007-1285. (2024) (Dec. 2018)
work page 2024
-
[65]
Development of the open-source dose calculation and optimization toolkit matRad
Wieser, H.-P ., Cisternas, E., Wahl, N., Ulrich, S., Stadler, A., Mescher, H., Müller, L.-R., Klinge, T., Gabrys, H., Burigo, L., et al. Development of the open-source dose calculation and optimization toolkit matRad. Medical Physics 44, 2556–2568 (2017)
work page 2017
-
[66]
Can generalist foundation models outcompete special-purpose tuning? case study in medicine
Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452 (2023)
-
[67]
Shin, T., Razeghi, Y., Logan IV , R. L., Wallace, E. & Singh, S. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics, Online, Nov. 2020), 4222–4235. https://aclanthology.org/2020...
work page 2020
-
[68]
Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B. & Lim, S.-N. Visual prompt tuning in European Conference on Computer Vision (2022), 709–727
work page 2022
-
[69]
Li, X. L. & Liang, P . Prefix-Tuning: Optimizing Continuous Prompts for Generationin Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Association for Computational Linguistics, Online, Aug. 2021), 4582–4597. https://ac...
work page 2021
- [70]
-
[71]
Ye, Q., Axmed, M., Pryzant, R. & Khani, F. Prompt engineering a prompt engineer. arXiv preprint arXiv:2311.05661 (2023)
-
[72]
arXiv preprint arXiv:2212.14024 (2022)
Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P ., Potts, C. & Zaharia, M. Demonstrate-Search- Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP. arXiv preprint arXiv:2212.14024 (2022)
-
[73]
Singhvi, A., Shetty, M., Tan, S., Potts, C., Sen, K., Zaharia, M. & Khattab, O. DSPy Assertions: Com- putational Constraints for Self-Refining Language Model Pipelines. arXiv preprint arXiv:2312.13382 (2023). 21 Automatic “Differentiation” via Text
-
[74]
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V ., Zhou, D. & Chen, X. Large Language Models as Optimizers in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/ forum?id=Bb4VGOWELI
work page 2024
-
[75]
Song, X., Tian, Y., Lange, R. T., Lee, C., Tang, Y. & Chen, Y. Position: Leverage Foundational Models for Black-Box Optimization 2024. arXiv: 2405.03547 [cs.LG]
-
[76]
Liu, T., Astorga, N., Seedat, N. & van der Schaar, M. Large Language Models to Enhance Bayesian Opti- mization in The Twelfth International Conference on Learning Representations(2024). https://openreview. net/forum?id=OOxotBmGol
work page 2024
-
[77]
Wang, R., Zelikman, E., Poesia, G., Pu, Y., Haber, N. & Goodman, N. Hypothesis Search: Inductive Rea- soning with Language Models in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=G7UtIGQmjm
work page 2024
-
[78]
T., Fan, Y., Zhao, V ., Lao, N., Lee, H., Juan, D.-C
Gao, L., Dai, Z., Pasupat, P ., Chen, A., Chaganty, A. T., Fan, Y., Zhao, V ., Lao, N., Lee, H., Juan, D.-C. & Guu, K. RARR: Researching and Revising What Language Models Say, Using Language Models in Pro- ceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers) (Association for Computational Linguisti...
work page 2023
- [79]
-
[80]
G., Madaan, A., Zeng, Y., Alon, U., Gardner, J
Shypula, A. G., Madaan, A., Zeng, Y., Alon, U., Gardner, J. R., Yang, Y., Hashemi, M., Neubig, G., Ranganathan, P ., Bastani, O. & Yazdanbakhsh, A. Learning Performance-Improving Code Edits in The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum? id=ix7rLVHXyY
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.