arxiv: 2310.03714 · v1 · submitted 2023-10-05 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab , Arnav Singhvi , Paridhi Maheshwari , Zhiyuan Zhang , Keshav Santhanam , Sri Vardhamanan , Saiful Haq , Ashutosh Sharma

show 5 more authors

Thomas T. Joshi Hanna Moazam Heather Miller Matei Zaharia Christopher Potts

Authors on Pith no claims yet

Pith reviewed 2026-05-11 18:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords DSPylanguage model pipelinesdeclarative modulesprompt optimizationself-bootstrappingcompilerfew-shot promptingperformance improvement

0 comments

The pith

DSPy turns a few lines of declarative code into language model pipelines that self-optimize and outperform few-shot and expert prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DSPy represents LM pipelines as graphs of declarative modules that invoke language models and can learn parameters by collecting their own demonstrations. A compiler then searches over possible module configurations to maximize a user-specified metric. This structure lets short programs build and improve sophisticated pipelines for math word problems, multi-hop retrieval, complex question answering, and agent loops. A sympathetic reader would care because the method replaces manual trial-and-error prompt writing with systematic, automatic optimization. Experiments show that compiled pipelines using GPT-3.5 or Llama2-13b-chat exceed standard few-shot baselines by large margins and often beat expert-written demonstrations.

Core claim

DSPy abstracts LM pipelines as text transformation graphs in which LMs are called through declarative, parameterized modules. The compiler optimizes any such pipeline for a given metric by automatically generating demonstrations and searching over module configurations and compositions of prompting, reasoning, and augmentation techniques. Succinct DSPy programs thereby produce pipelines that, after compilation, outperform standard few-shot prompting and expert-created demonstrations on tasks including math reasoning and multi-hop QA.

What carries the argument

Parameterized DSPy modules inside computational graphs, together with a compiler that collects demonstrations and searches configurations to maximize a target metric.

If this is right

Succinct DSPy programs can express and optimize complex pipelines for reasoning, retrieval, and control tasks.
Open models as small as 770M-parameter T5 become competitive with expert prompt chains written for proprietary GPT-3.5.
The same declarative program can be recompiled for different metrics or models without rewriting prompts.
Models can self-bootstrap training data and improve their own performance on the target task within minutes.
Pipeline development shifts from hand-crafted strings to declarative code plus automatic optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could lower the expertise barrier for building reliable LM applications by automating much of the prompt engineering.
Compiled pipelines might adapt more readily to new domains if the compiler is given additional unlabeled data or metrics.
Extending the same declarative graph structure to multimodal or tool-using agents would be a natural next step.
Combining the compiler with lightweight fine-tuning on the collected demonstrations could further improve small-model performance.

Load-bearing premise

Automatic search over module configurations driven by collected demonstrations will reliably locate high-performing pipelines without overfitting to the validation metric or demanding prohibitive compute.

What would settle it

On a new task the DSPy compiler produces a pipeline whose accuracy is no higher than, or lower than, a standard few-shot prompt baseline using the same underlying language model.

read the original abstract

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded "prompt templates", i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies, showing that succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, a few lines of DSPy allow GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot prompting (generally by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively). On top of that, DSPy programs compiled to open and relatively small LMs like 770M-parameter T5 and llama2-13b-chat are competitive with approaches that rely on expert-written prompt chains for proprietary GPT-3.5. DSPy is available at https://github.com/stanfordnlp/dspy

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DSPy gives a declarative module system and compiler for LM pipelines that can auto-bootstrap better performance than hand prompts, but the optimizer's selection process needs tighter controls to rule out overfitting.

read the letter

The main point is that DSPy turns LM pipeline design into a programming task where you declare modules for prompting, reasoning, or retrieval, then let a compiler optimize the whole thing by generating and picking demonstrations to maximize a metric. A few lines of code end up beating both basic few-shot and expert-written chains on the reported tasks, with larger lifts on the weaker open model like llama2-13b-chat.

Referee Report

1 major / 2 minor

Summary. The paper introduces DSPy, a programming model that represents LM pipelines as imperative computational graphs of declarative, parameterized modules. These modules learn by collecting demonstrations to compose prompting, reasoning, and other techniques. A compiler optimizes any DSPy program for a given metric via bootstrap search over module configurations and auto-generated demonstrations. Two case studies demonstrate that short DSPy programs enable GPT-3.5 and Llama-2-13B-chat to self-improve pipelines for math word problems, multi-hop QA, and agent control, outperforming standard few-shot prompting (by >25% and >65%) and expert demonstrations (by up to 5-46% and 16-40%). Compiled DSPy programs on smaller open models are competitive with expert GPT-3.5 chains.

Significance. If the reported gains are robust to validation-set selection bias, the work offers a valuable systematic alternative to manual prompt engineering by turning pipeline design into a programmable, optimizable artifact. The public GitHub release of the DSPy library supports reproducibility and further experimentation.

major comments (1)

[§4] §4 (Bootstrap Optimizer): The optimizer repeatedly samples LM-generated demonstrations, scores candidate pipelines on a validation metric, and selects the best configuration. No separate held-out selection set, Bonferroni-style correction, or post-selection evaluation on untouched data is described. When the base LM is weak (e.g., Llama-2-13B-chat), noisy or metric-correlated demonstrations can amplify selection bias. This directly affects the central claim that the compiler reliably discovers high-performing pipelines, because the 25-65% gains over few-shot baselines and the 5-46% gains over expert prompts could partly reflect overfitting rather than genuine improvement.

minor comments (2)

[Abstract and §5] The abstract and experimental sections provide no details on the compiler's search algorithm (e.g., beam size, number of rounds), hyperparameter choices, or statistical significance testing of the reported deltas.
[Figures/Tables in §5] Figure and table captions could more explicitly state the exact validation metric used for each task and whether the same split was used for both optimization and final reporting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of DSPy's significance and for the detailed feedback on the bootstrap optimizer. We address the major comment below.

read point-by-point responses

Referee: [§4] §4 (Bootstrap Optimizer): The optimizer repeatedly samples LM-generated demonstrations, scores candidate pipelines on a validation metric, and selects the best configuration. No separate held-out selection set, Bonferroni-style correction, or post-selection evaluation on untouched data is described. When the base LM is weak (e.g., Llama-2-13B-chat), noisy or metric-correlated demonstrations can amplify selection bias. This directly affects the central claim that the compiler reliably discovers high-performing pipelines, because the 25-65% gains over few-shot baselines and the 5-46% gains over expert prompts could partly reflect overfitting rather than genuine improvement.

Authors: We agree that the bootstrap optimizer, as currently described in §4, uses the validation set both to generate demonstrations and to select the best pipeline configuration, without a separate held-out selection set or post-selection evaluation on untouched data. This design is intentional for practical settings with limited labeled data, but we acknowledge the referee's point that it can introduce selection bias, particularly with weaker base models. The reported gains are measured on fully held-out test sets, yet the optimization step itself may overfit to the validation metric. We will revise the manuscript to (1) explicitly discuss this limitation in §4, (2) add experiments that reserve a portion of the validation data solely for post-selection evaluation, and (3) report results with Bonferroni-style corrections where multiple configurations are compared. These changes will provide stronger evidence that the observed improvements reflect genuine pipeline optimization rather than overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external test sets against fixed baselines

full rationale

The paper introduces DSPy as a declarative programming model and compiler for LM pipelines, with optimizers (including bootstrap) that collect demonstrations and search configurations to maximize a user-specified metric. The central claims consist of empirical results: compiled pipelines outperform standard few-shot prompting and expert demonstrations on held-out test sets for tasks like math word problems and multi-hop QA. These comparisons use fixed external baselines rather than quantities defined inside the DSPy system. No equations, uniqueness theorems, or first-principles derivations appear that reduce a reported prediction to a fitted parameter or self-citation by construction. The bootstrap process is described as an optimization procedure whose outputs are evaluated externally, rendering the reported performance self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that language-model behavior can be usefully abstracted as learnable declarative modules whose configurations can be searched by a compiler; no numerical constants are fitted in the reported results.

axioms (1)

domain assumption Language models respond usefully to compositions of prompting, finetuning, and reasoning techniques when those techniques are expressed through parameterized declarative modules.
This is the foundational modeling choice stated in the abstract.

invented entities (2)

DSPy module no independent evidence
purpose: Parameterized unit that invokes an LM and can learn from collected demonstrations
New abstraction introduced by the paper; no independent evidence outside the framework itself.
DSPy compiler no independent evidence
purpose: Optimizer that searches module configurations to maximize a metric
New component introduced by the paper; no independent evidence outside the framework itself.

pith-pipeline@v0.9.0 · 5656 in / 1322 out tokens · 38086 ms · 2026-05-11T18:52:19.364355+00:00 · methodology

discussion (0)

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows
cs.CL 2026-05 unverdicted novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 conditional novelty 8.0

LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Efficient Ensemble Selection from Binary and Pairwise Feedback
cs.GT 2026-05 unverdicted novelty 7.0

The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta...
TRACE: Tourism Recommendation with Accountable Citation Evidence
cs.IR 2026-05 unverdicted novelty 7.0

TRACE is a new benchmark dataset and evaluation suite for conversational tourism recommenders that requires systems to suggest POIs, cite verifiable review spans, and recover from rejections, revealing a Three-Compete...
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
cs.AI 2026-05 conditional novelty 7.0

Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- ...
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
cs.SE 2026-05 unverdicted novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
cs.CL 2026-04 unverdicted novelty 7.0

AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design
cs.LG 2026-04 unverdicted novelty 7.0

RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than...
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
Meta-Harness: End-to-End Optimization of Model Harnesses
cs.AI 2026-03 unverdicted novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
cs.CL 2024-06 accept novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
cs.CL 2026-05 unverdicted novelty 6.0

LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
stat.ML 2026-05 unverdicted novelty 6.0

SIREN corrects winner's curse bias in adaptive LLM benchmarking via selection-aware repeated splits and bootstrap for valid procedure-level confidence intervals.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
The Two Boundaries: Why Behavioral AI Governance Fails Structurally
cs.AI 2026-04 conditional novelty 6.0 partial

Behavioral governance of AI effects is undecidable for Turing-complete architectures, making coterminous boundaries via computation-effect separation the only structural solution rather than post-hoc layers.
Probabilistic Programs of Thought
cs.CL 2026-04 unverdicted novelty 6.0

Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
cs.AI 2026-04 conditional novelty 6.0

The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
cs.AI 2026-04 unverdicted novelty 6.0

Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
cs.AI 2026-04 unverdicted novelty 6.0

POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
Behavior Latticing: Inferring User Motivations from Unstructured Interactions
cs.HC 2026-04 unverdicted novelty 6.0

Behavior latticing synthesizes connections across unstructured user interactions to generate insights into underlying motivations, yielding deeper and more accurate user understanding than task-only models.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
cs.CL 2026-04 unverdicted novelty 6.0

A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
SGLang: Efficient Execution of Structured Language Model Programs
cs.AI 2023-12 conditional novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis
cs.LG 2026-05 unverdicted novelty 5.0

The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples,...
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
AgenticPosesRanker: An Agentic AI Framework for Physically Grounded Ranking of Protein-Ligand Docking Poses
q-bio.BM 2026-05 conditional novelty 5.0

AgenticPosesRanker ranks docking poses using six deterministic physical tools and LLM reasoning, achieving 50% best-pose accuracy that matches the Smina baseline on a balanced 10-system, 162-pose benchmark.
LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation
cs.CL 2026-04 unverdicted novelty 5.0

Two-stage Schema-Guided Reasoning with LLM condensation and deterministic compilation achieves macro-F1 of 0.63 on dyspnea CRF filling task and is language-agnostic.
Auditing and Controlling AI Agent Actions in Spreadsheets
cs.HC 2026-04 unverdicted novelty 5.0

Pista decomposes AI agent actions in spreadsheets into auditable steps, enabling real-time user intervention that improves task outcomes, user comprehension, agent perception, and sense of co-ownership over baseline agents.
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
cs.SE 2026-04 unverdicted novelty 5.0

Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
cs.CL 2026-04 unverdicted novelty 5.0

LLM features optimized for high information coefficient with returns do not reliably improve PPO trading policies under distribution shifts, where price-only or macro baselines remain more robust.
Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
cs.CL 2026-04 unverdicted novelty 5.0

AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.
LLM Agents Predict Social Media Reactions but Do Not Outperform Text Classifiers: Benchmarking Simulation Accuracy Using 120K+ Personas of 1511 Humans
cs.CL 2026-03 conditional novelty 5.0

Zero-shot LLM agents with human personas predict individual social media reactions better than chance (MCC 0.29) but worse than conventional text classifiers (MCC 0.36).
100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
cs.DB 2026-03 unverdicted novelty 5.0

Lightweight proxy models deliver over 100x cost and latency savings for semantic AI queries in databases with accuracy preserved or improved on benchmarks up to 10M rows.
Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage
cs.IR 2026-03 unverdicted novelty 5.0

Coverage-focused retrieval metrics correlate strongly with nugget coverage in RAG responses across text and multimodal benchmarks, supporting their use as performance proxies when retrieval and generation goals align.
TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature
cs.CL 2026-05 unverdicted novelty 4.0

TCMIIES is a zero-install browser platform with schema-guided LLM prompting that achieves over 94% structured output compliance for academic information extraction, including support for Chinese databases.
Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
cs.SE 2026-04 accept novelty 4.0

Execution feedback in refinement loops improves 1-3B code generation performance far more than complex pipeline topologies discovered via evolutionary search on HumanEval and sanitized MBPP.
Supplement Generation Training for Enhancing Agentic Task Performance
cs.LG 2026-04 unverdicted novelty 4.0

SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
Statistical Software Engineering with Tuned Variables
cs.SE 2026-04 unverdicted novelty 4.0

AI system maintenance requires treating configuration choices as versioned governed tuned variables promoted via statistical evidence from sampled evaluations.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO
cs.CL 2026-04 unverdicted novelty 3.0

Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.
Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study
cs.AI 2026-04 unverdicted novelty 3.0

A deployed modular inference architecture for compound AI systems cut tail latency over 50%, boosted throughput up to 3.9x, and reduced costs 30-40% while handling multi-model agent workloads.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 46 Pith papers · 19 internal anchors

[1]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.\ 2623--2631, 2019

work page 2019
[2]

Theano: A Python framework for fast computation of mathematical expressions

Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Fr \'e d \'e ric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, pp.\ arXiv--1605, 2016

work page 2016
[3]

Theano: A CPU and GPU math compiler in Python

James Bergstra, Olivier Breuleux, Fr \'e d \'e ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A CPU and GPU math compiler in Python . In Proc. 9th python in science conf, volume 1, pp.\ 3--10, 2010

work page 2010
[4]

Theano: Deep learning on gpus with Python

James Bergstra, Fr \'e d \'e ric Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, David Warde-Farley, Ian Goodfellow, Arnaud Bergeron, et al. Theano: Deep learning on gpus with Python . In NIPS 2011, BigLearning Workshop, Granada, Spain, volume 3. Citeseer, 2011

work page 2011
[5]

Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures

James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pp.\ 115--123. PMLR, 2013

work page 2013
[6]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[8]

Hwchase17/langchain

Harrison Chase. Hwchase17/langchain. 2022. URL https://github.com/hwchase17/langchain

work page 2022
[9]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading W ikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1870--1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1171. URL https://acl...

work page doi:10.18653/v1/p17-1171 2017
[10]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review arXiv 2023
[11]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review arXiv 2022
[12]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Torch: a modular machine learning software library

Ronan Collobert, Samy Bengio, and Johnny Mari \'e thoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002

work page 2002
[15]

arXiv preprint arXiv:2207.10342 , year=

David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades. arXiv preprint arXiv:2207.10342, 2022

work page arXiv 2022
[16]

Rarr: Researching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 16477--16...

work page 2023
[17]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.\ 10764--10799. PMLR, 2023 b

work page 2023
[18]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2023

work page arXiv 2023
[19]

REALM: Retrieval-Augmented Language Model Pre-Training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020. URL https://arxiv.org/abs/2002.08909

work page internal anchor Pith review arXiv 2002
[20]

Training classifiers with natural language explanations

Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher R \'e . Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1884--1895. Association for Computational Linguistics, 2018. URL http://aclweb...

work page 2018
[21]

Enabling intelligent interactions between an agent and an llm: A reinforcement learning approach

Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, and Bin Liu. Enabling intelligent interactions between an agent and an LLM : A reinforcement learning approach. arXiv preprint arXiv:2306.03604, 2023. URL https://arxiv.org/abs/2306.03604

work page arXiv 2023
[22]

Huang, S

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022

work page arXiv 2022
[23]

Atlas: Few-shot learning with retrieval augmented language models,

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022

work page arXiv 2022
[24]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445, 2022

work page internal anchor Pith review arXiv 2022
[25]

B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval

Omar Khattab, Christopher Potts, and Matei Zaharia. B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a

work page 2021
[26]

Relevance-guided supervision for openqa with ColBERT

Omar Khattab, Christopher Potts, and Matei Zaharia. Relevance-guided supervision for openqa with ColBERT . Transactions of the Association for Computational Linguistics, 9: 0 929--944, 2021 b

work page 2021
[27]

arXiv preprint arXiv:2212.14024 (2022)

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022

work page arXiv 2022
[28]

Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalken- burgh, Shengxin Zha, Bolin Lai, Licheng Yu, and 1 others

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

work page arXiv 2022
[29]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022

work page internal anchor Pith review arXiv 2022
[30]

Internet-augmented language models through few-shot prompting for open-domain question answering,

Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022

work page arXiv 2022
[31]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural ...

work page 2020
[32]

LlamaIndex , 11 2022

Jerry Liu. LlamaIndex , 11 2022. URL https://github.com/jerryjliu/llama_index

work page 2022
[33]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv:1806.08730, 2018. URL https://arxiv.org/abs/1806.08730

work page Pith review arXiv 2018
[35]

Semantic kernel

Microsoft. Semantic kernel. 2023. URL https://learn.microsoft.com/semantic-kernel/

work page 2023
[36]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT : Browser-assisted question-answering with human feedback, 2021. URL https://...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[38]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

PyTorch : An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch : An imperative style, high-perf...

work page 2019
[40]

Din-sql: Decomposed in-context learning of text-to-sql with self-correction

Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. arXiv preprint arXiv:2304.11015, 2023

work page arXiv 2023
[41]

arXiv preprint arXiv:2210.03350

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

work page arXiv 2022
[42]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023

work page arXiv 2023
[43]

Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Christopher D. Manning. Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2590--2602, Hong Kong, ...

work page doi:10.18653/v1/d19-1261 2019
[44]

Retrieve, rerank, read, then iterate: Answering open-domain questions of arbitrary complexity from text

Peng Qi, Haejun Lee, Oghenetegiri Sido, Christopher D Manning, et al. Retrieve, rerank, read, then iterate: Answering open-domain questions of arbitrary complexity from text. arXiv preprint arXiv:2010.12527, 2020. URL https://arxiv.org/abs/2010.12527

work page arXiv 2010
[45]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. Ms, OpenAI, 2018. URL https://openai.com/blog/language-unsupervised/

work page 2018
[46]

Data programming: Creating large training sets, quickly

Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R\' e . Data programming: Creating large training sets, quickly. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp.\ 3567--3575. Curran Associates, Inc., 2016. URL https://papers.nips.cc/paper/65...

work page 2016
[47]

arXiv preprint arXiv:2112.01488 , year=

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. C ol BERT v2: E ffective and E fficient R etrieval via L ightweight L ate I nteraction. arXiv preprint arXiv:2112.01488, 2021

work page arXiv 2021
[48]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Synthetic prompting: Generating chain-of-thought demonstrations for large language models

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618, 2023

work page arXiv 2023
[50]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Prompting gpt-3 to be reliable

Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, and Lijuan Wang. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022

work page arXiv 2022
[52]

arXiv preprint arXiv:2210.01296 , year=

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. arXiv preprint arXiv:2210.01296, 2022

work page arXiv 2022
[53]

Chainer: a next-generation open source framework for deep learning

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), volume 5, pp.\ 1--6, 2015

work page 2015
[54]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

innocent until proven guilty

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022

work page arXiv 2022
[56]

Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming

Fei Wang, James Decker, Xilun Wu, Gregory Essertel, and Tiark Rompf. Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. U...

work page 2018
[57]

Rationale-Augmented Ensembles in Language Models https://arxiv.org/abs/2207.00747

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747, 2022 a

work page arXiv 2022
[58]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

Transformers: State-of-the-Art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[61]

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023

work page internal anchor Pith review arXiv 2023
[62]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review arXiv 2018
[63]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Answering questions by meta-reasoning over multiple chains of thought

Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023

work page arXiv 2023
[65]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465, 2022

work page arXiv 2022
[66]

Automatic chain of thought prompting in large language models,

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022

work page arXiv 2022
[67]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL : LLM agents are experiential learners. arXiv preprint arXiv:2308.10144, 2023 a . URL https://arxiv.org/pdf/2308.10144

work page arXiv 2023
[68]

Automatic model selection with large language models for reasoning

Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, and Qizhe Xie. Automatic model selection with large language models for reasoning. arXiv preprint arXiv:2305.14333, 2023 b

work page arXiv 2023