arxiv: 2605.08904 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

Haodong Duan, Jixuan Chen, Kai Chen, Qingwen Liu, Shengyuan Ding, Xiaozhe Li, Xinyu Fang

Pith reviewed 2026-05-12 02:48 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsself-improvementbenchmarkiterative optimizationfeedback loopsNP-hard problemsmachine learning tasksmodel capacity

0 comments

The pith

LLMs self-improve iteratively through feedback but remain limited by base model capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OPT-BENCH to measure whether language models can keep refining solutions in large, changing search spaces by using environmental feedback to guide their own adjustments. It combines twenty machine learning tasks with ten classic NP-hard problems to create settings where rote recall is unlikely to suffice and genuine adaptation is required. The authors also introduce OPT-Agent, a simple loop that cycles through perception of the current state, memory of past attempts, and reasoning to propose the next change. Tests on nineteen models from seven families show that larger and more capable base models extract more benefit from each round of feedback, yet every model tested stops well short of human expert results. This finding matters because it separates the question of clever agent design from the deeper question of whether the underlying model already contains enough raw capacity to improve itself without external scaffolding.

Core claim

OPT-BENCH evaluates iterative self-optimization of LLM agents by placing them in combined machine-learning and NP-hard problem environments where they must repeatedly adjust solutions after receiving environmental feedback. OPT-Agent implements this process through a closed perception-memory-reasoning loop that updates internal state and generates the next candidate solution. Experiments across nineteen models demonstrate that stronger base models convert feedback into larger performance gains, yet the absolute ceiling of this improvement stays strictly determined by the model's initial capacity and never reaches human-expert levels.

What carries the argument

OPT-BENCH benchmark together with the OPT-Agent perception-memory-reasoning loop, which repeatedly reads environmental feedback and produces the next candidate solution in large discrete search spaces.

If this is right

Performance differences between model families will persist even when all models use identical adaptation loops.
Increasing the number of iterations will produce diminishing returns once a model's base capacity is reached.
Human-expert performance on these tasks will remain out of reach for any current LLM regardless of iteration count.
Self-optimization frameworks cannot substitute for improvements in the underlying model's training or scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent research may yield larger returns by first raising base model capacity before designing more elaborate feedback loops.
The benchmark could be reused to test whether hybrid systems that combine an LLM with an external optimizer can exceed the pure LLM ceiling.
Similar feedback loops might be applied to domains outside optimization, such as code refactoring or scientific hypothesis refinement, to check whether the capacity limit is domain-specific.

Load-bearing premise

The selected machine-learning tasks and NP-hard problems, together with the OPT-Agent loop, measure intrinsic self-reflection and adaptation rather than rewarding patterns or tool-use skills already present in the base models.

What would settle it

If weaker models show equal or larger relative gains than stronger models after the same number of feedback iterations on the benchmark tasks, the claim that stronger models leverage feedback more effectively would be refuted.

Figures

Figures reproduced from arXiv: 2605.08904 by Haodong Duan, Jixuan Chen, Kai Chen, Qingwen Liu, Shengyuan Ding, Xiaozhe Li, Xinyu Fang.

**Figure 1.** Figure 1: OPT-BENCH framework. This framework evaluates Iterative Self-Optimization by integrating two distinct reasoning modalities: Continuous Parametric Optimization for machine learning tasks (Top), and Discrete Combinatorial Reasoning for NP-hard problems (Bottom). In both settings, the agent leverages Environmental Feedback to guide its trajectory of improvement and debugging, progressively bridging the gap be… view at source ↗

**Figure 2.** Figure 2: Overview of the OPT-BENCH dataset and OPT-Agent Framework. The left panel illustrates the data structure of OPT-Bench, encompassing ML and NP problems. Each module includes problem definitions, dataset files, validation script (NP), evaluation metrics, and submission formats, integrating human-verified initial solutions and LLM-assisted refinement. The right panel details the evaluation workflow, where sol… view at source ↗

**Figure 3.** Figure 3: Specific cases from OPT-BENCH. Take the spaceship titanic classification task and the Hamiltonian cycle optimization problem as representative examples. metric quantifies the model’s ability to perform self-optimization. A high Win Count indicates that the model is successfully interpreting environmental feedback and using it to guide performance improvement. • Improvement Rate (IR): A quantitative meas… view at source ↗

**Figure 4.** Figure 4: Prompt Template of OPT-Agent. Orange denotes draft action. Green denotes improve action. Purple denotes debug action. Blue denotes shared prompts [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Optimization Trajectories. The figure contrasts the agent’s path against the environment. In ML (Top), the agent uses error logs to monotonicially improve, demonstrating true self-optimization. In NP (Bottom), feedback often triggers erratic jumps, indicating a struggle to map discrete environmental signals to valid solution updates. Metric: 0.2765 Info: Hyperparameter Tuning Code: model = LGBMRegressor(n_… view at source ↗

**Figure 6.** Figure 6: Detailed OPT-Agent-ML Trace on the Bike Sharing Demand Task, utilizing gemini-2.0-flash as LLM base model. The red, and blue nodes represent the improve, and debug action, respectively [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Detailed OPT-Agent-NP Trace on the Hamiltonian Cycle Task, utilizing gemini-2.0-flash as the LLM base model. The yellow, red, and blue nodes represent the draft, improve, and debug action, respectively [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Fixed prompts in OPT-Agent. This encompasses the response format, implementation guidelines, solution draft sketch guidelines, solution improvement sketch guidelines, and solution debug sketch guidelines for ML tasks, as well as example inputs and outputs, instructions, and response format for NP problems [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and tool use. However, the fundamental cognitive faculties essential for problem solving, including perception, reasoning, and memory, remain the stable core of intelligence. Unlike memorizing specific patterns, humans succeed in novel environments by applying these intrinsic faculties to adapt and optimize. Yet, whether LLMs possess this essential capacity, namely the ability to continuously refine solutions in response to dynamic environmental feedback, remains underexplored. To address this challenge, we introduce OPT-BENCH, a benchmark for evaluating self-improvement capabilities in large-scale search spaces. By combining 20 machine learning tasks with 10 classic NP-hard problems, OPT-BENCH provides a rigorous setting to assess whether agents can adapt through intrinsic self-reflection rather than rote tool application. We further propose OPT-Agent, a framework that emulates human-like cognitive adaptation. It operates through a general perception, memory, and reasoning loop, iteratively refining solutions based on environmental feedback. Through extensive experiments on 19 LLMs from 7 model families, including reasoning models, general models, and open-source models ranging from 3B to 235B parameters, we demonstrate that stronger models are more effective at leveraging feedback signals for self-improvement. However, this upper-bound adaptability remains fundamentally constrained by the models' base capacity, and even the most advanced LLMs still fall short of human expert performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces OPT-BENCH and OPT-Agent to test iterative self-optimization in LLMs across ML and NP-hard tasks, but the reported gains may largely track base model strength rather than the feedback loop itself.

read the letter

The core new thing here is the benchmark itself. OPT-BENCH combines 20 machine learning tasks with 10 classic NP-hard problems in a setting meant to reward ongoing refinement from environmental feedback. OPT-Agent adds a perception-memory-reasoning loop that runs iteratively. That combination is not already standard in the literature they cite, and running it on 19 models from 7 families gives a broad snapshot of current behavior. The main pattern they report is straightforward: larger and stronger models extract more value from the feedback, yet even the top ones stay well below human expert levels. That observation is useful as a rough upper bound on what current LLMs can do with this kind of loop. The experiments are at least reproducible in principle since they name the models and task categories. The citation list covers the usual agent and benchmark papers without obvious gaps or self-promotion. The soft spot is exactly the one the stress-test note flags. The abstract and the high-level description give no sign of single-pass or fixed-turn baselines that hold token budget constant, nor any ablation that turns off the iterative feedback while keeping everything else the same. Without those, the performance differences across model sizes could simply reflect better initial reasoning or tool use rather than better adaptation. The claim that the setup isolates intrinsic self-reflection over rote patterns therefore rests on design choices that are not shown. Minor details like exact feedback formats and statistical controls are also missing from the summary, though the full paper may supply them. This work is aimed at researchers building or evaluating agentic systems that need to handle long search spaces. Anyone working on automated discovery or planning benchmarks would find the task mix relevant. It is coherent on its own terms and shows honest engagement with the limits of current models, so it clears the bar for serious refereeing even though the controls need work. I would send it out for review with a specific request for the missing non-iterative baselines.

Referee Report

1 major / 2 minor

Summary. The paper introduces OPT-BENCH, a benchmark that combines 20 machine learning tasks with 10 classic NP-hard problems to evaluate iterative self-optimization in LLM agents operating over large search spaces. It proposes the OPT-Agent framework, which implements a perception-memory-reasoning loop that iteratively refines solutions using environmental feedback. Experiments across 19 LLMs from 7 families (3B to 235B parameters) are used to argue that stronger models more effectively exploit feedback for self-improvement, yet remain fundamentally limited by base model capacity and still lag human experts.

Significance. If the experimental controls isolate iterative adaptation from base-model reasoning, the work would supply a useful new benchmark and framework for studying LLM agent self-optimization beyond static prompting. The scale of the evaluation (19 models, 30 tasks) and the explicit comparison to human performance are strengths that could inform future agent design. The paper does not ship machine-checked proofs or parameter-free derivations, but the empirical scope is a positive contribution to the LLM-agent evaluation literature.

major comments (1)

[Abstract and Experimental Results] The central claim that 'stronger models are more effective at leveraging feedback signals for self-improvement' and that adaptability 'remains fundamentally constrained by the models' base capacity' is load-bearing for the entire contribution. The manuscript reports comparative results across model families but provides no details on statistical controls, exact feedback mechanisms, or how rote tool use was ruled out (Abstract). In particular, no ablations are described that compare the full OPT-Agent iterative loop against single-pass or fixed-turn prompting with matched token budgets on the same 30 tasks. Without these controls, observed performance differences could simply reflect superior zero-shot reasoning or tool heuristics already present in stronger base models rather than differential ability to use the perception-memory-reasoning loop.

minor comments (2)

[Abstract] The abstract states that OPT-BENCH 'provides a rigorous setting to assess whether agents can adapt through intrinsic self-reflection rather than rote tool application,' yet does not quantify the size of the search spaces or the number of iterations permitted; adding these numbers would improve clarity.
[OPT-Agent Framework] Notation for the OPT-Agent loop (perception, memory, reasoning) is introduced without an accompanying diagram or pseudocode in the main text; a compact figure would aid readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed feedback. We agree that stronger controls are needed to isolate the effects of the iterative loop and will revise the manuscript accordingly to better substantiate our central claims.

read point-by-point responses

Referee: [Abstract and Experimental Results] The central claim that 'stronger models are more effective at leveraging feedback signals for self-improvement' and that adaptability 'remains fundamentally constrained by the models' base capacity' is load-bearing for the entire contribution. The manuscript reports comparative results across model families but provides no details on statistical controls, exact feedback mechanisms, or how rote tool use was ruled out (Abstract). In particular, no ablations are described that compare the full OPT-Agent iterative loop against single-pass or fixed-turn prompting with matched token budgets on the same 30 tasks. Without these controls, observed performance differences could simply reflect superior zero-shot reasoning or tool heuristics already present in stronger base models rather than differential ability to use the perception-memory-reasoning loop.

Authors: We acknowledge the importance of these controls for validating the load-bearing claims. In the revised version we will expand the Methods section with: (1) precise specifications of the feedback signals and perception-memory-reasoning loop implementation; (2) statistical controls including multiple independent runs per model-task pair, standard-error reporting, and paired significance tests; and (3) explicit discussion of why the benchmark tasks require iterative refinement beyond initial tool calls, thereby distinguishing the framework from rote tool use. Most critically, we will add the requested ablations: full OPT-Agent versus single-pass prompting and versus fixed-turn prompting, all with matched token budgets, evaluated on the identical 30 tasks. These new results will be presented in a dedicated subsection and will directly test whether performance gains arise from the iterative loop rather than base-model differences alone. We believe the added experiments will materially strengthen the paper. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or self-referential reductions

full rationale

The paper introduces OPT-BENCH and the OPT-Agent framework purely as an empirical evaluation tool, combining ML tasks and NP-hard problems to test LLM adaptation via experiments across 19 models. No equations, parameters, or derivations are presented that could reduce performance claims to fitted inputs or self-definitions by construction. Claims about stronger models leveraging feedback better (yet constrained by base capacity) rest on reported experimental outcomes rather than any load-bearing self-citation chain or ansatz smuggled from prior work. This matches the default case of a self-contained empirical benchmark with independent content from its evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that the selected tasks require genuine iterative adaptation rather than memorized patterns, and that the perception-memory-reasoning loop faithfully emulates human-like self-improvement. No free parameters or invented physical entities are introduced; the main additions are the benchmark construction and the agent architecture itself.

axioms (2)

domain assumption LLMs possess stable core faculties of perception, reasoning, and memory that can be applied to novel environments via iterative feedback.
Stated in the opening paragraph as the stable core of intelligence that the benchmark aims to test.
domain assumption The 20 ML tasks plus 10 NP-hard problems constitute a rigorous setting for distinguishing self-reflection from rote tool application.
Used to justify the benchmark design as the basis for all reported comparisons.

invented entities (2)

OPT-BENCH benchmark no independent evidence
purpose: Provide a standardized large-scale search space for measuring self-improvement.
Newly constructed testbed; independent evidence would be public release and adoption by other groups.
OPT-Agent framework no independent evidence
purpose: Emulate human-like cognitive adaptation through a perception-memory-reasoning loop.
Newly proposed agent architecture; independent evidence would be reproducible code and results on the benchmark.

pith-pipeline@v0.9.0 · 5579 in / 1617 out tokens · 68063 ms · 2026-05-12T02:48:05.512979+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
OPT-Agent operates through a general perception, memory, and reasoning loop, iteratively refining solutions based on environmental feedback

Reference graph

Works this paper leans on

150 extracted references · 150 canonical work pages · 36 internal anchors

[1]

Chan, J. S. and Chowdhury, N. and Jaffe, O. and Aung, J. and Sherburn, D. and Mays, E. and Starace, G. and Liu, K. and Maksin, L. and Patwardhan, T. and Weng, L. and M

work page
[2]

Artificial Intelligence , volume=

Planning and acting in partially observable stochastic domains , author=. Artificial Intelligence , volume=. 1998 , publisher=

work page 1998
[3]

and Eggensperger, K

Feurer, M. and Eggensperger, K. and Bergman, E. and Pfisterer, F. and Bischl, B. and Hutter, F. , title =

work page
[4]

and Poirier, S

LeDell, E. and Poirier, S. , title =. 2020 , pages =

work page 2020
[5]

Olson, R. S. and Moore, J. H. , title =

work page
[6]

and Chollet, F

Jin, H. and Chollet, F. and Song, Q. and Hu, X. , title =. Journal of Machine Learning Research , volume =

work page
[7]

and Song, Q

Jin, H. and Song, Q. and Hu, X. , title =

work page
[8]

and Hutter, F

Thornton, C. and Hutter, F. and Hoos, H. H. and Leyton-Brown, K. , title =. Journal of Machine Learning Research , volume =

work page
[9]

and Hutter, F

Thornton, C. and Hutter, F. and Hoos, H. H. and Leyton-Brown, K. , title =

work page
[10]

and et al

Mueller, J. and et al. , title =. Scientific Reports , volume =

work page
[11]

and Xie, Y

Wang, G. and Xie, Y. and Jiang, Y. and Mandlekar, A. and Xiao, C. and Zhu, Y. and Fan, L. and Anandkumar, A. , title =

work page
[12]

Ma, Y. J. and Liang, W. and Wang, G. and Huang, D. and Bastani, O. and Jayaraman, D. and Zhu, Y. and Fan, L. and Anandkumar, A. , title =

work page
[13]

and et al

Li, Y. and et al. , title =. Science , volume =. 2022 , doi =

work page 2022
[14]

and Vora, J

Huang, Q. and Vora, J. and Liang, P. and Leskovec, J. , title =

work page
[15]

and Fourrier, C

Mialon, G. and Fourrier, C. and Swift, C. and Wolf, T. and LeCun, Y. and Scialom, T. , title =

work page
[16]

and et al

Silver, D. and et al. , title =. Nature , volume =

work page
[17]

and Klein, A

Feurer, M. and Klein, A. and Eggensperger, K. and Springenberg, J. T. and Blum, M. and Hutter, F. , title =

work page
[18]

and Klein, A

Falkner, S. and Klein, A. and Hutter, F. , title =

work page
[19]

and Zhao, J

Yao, S. and Zhao, J. and Yu, D. and Du, N. and Shafran, I. and Narasimhan, K. and Cao, Y. , title =

work page
[20]

and Song, K

Shen, Y. and Song, K. and Tan, X. and Li, D. and Lu, W. and Zhuang, Y. , title =

work page
[21]

and Dwivedi-Yu, J

Schick, T. and Dwivedi-Yu, J. and Dessì, R. and Raileanu, R. and Lomeli, M. and Zettlemoyer, L. and Cancedda, N. and Scialom, T. , title =

work page
[22]

and Le, Q

Zoph, B. and Le, Q. V. , title =

work page
[23]

and Guan, M

Pham, H. and Guan, M. Y. and Zoph, B. and Le, Q. V. and Dean, J. , title =

work page
[24]

and Simonyan, K

Liu, H. and Simonyan, K. and Yang, Y. , title =

work page
[25]

and Aggarwal, A

Real, E. and Aggarwal, A. and Huang, Y. and Le, Q. V. , title =

work page
[26]

and Shami, A

Yang, L. and Shami, A. , title =. Neurocomputing , volume =

work page
[27]

and Metzen, J

Elsken, T. and Metzen, J. H. and Hutter, F. , title =. Journal of Machine Learning Research , volume =

work page
[28]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Wang, X. and Li, B. and Song, Y. and Xu, F. F. and Tang, X. and Zhuge, M. and Pan, J. and Song, Y. and Li, B. and Singh, J. and others , title =. arXiv preprint arXiv:2407.16741 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[29]

2024 , url=

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

work page 2024
[30]

The Thirteenth International Conference on Learning Representations , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[32]

2025 , institution =

OpenAI o3-mini System Card , author=. 2025 , institution =

work page 2025
[33]

2025 , institution =

OpenAI o1 System Card , author=. 2025 , institution =

work page 2025
[34]

2024 , url=

Claude 3.5 sonnet model card addendum , author=. 2024 , url=

work page 2024
[35]

White, M

Neural architecture search: Insights from 1000 papers , author=. arXiv preprint arXiv:2301.08727 , year=

work page arXiv
[36]

Forty-first International Conference on Machine Learning , year=

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , author=. Forty-first International Conference on Machine Learning , year=

work page
[37]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

7th ICML Workshop on Automated Machine Learning (AutoML) , url =

Erin LeDell and Sebastien Poirier , year =. 7th ICML Workshop on Automated Machine Learning (AutoML) , url =

work page
[39]

2025 , eprint=

AIDE: AI-Driven Exploration in the Space of Code , author=. 2025 , eprint=

work page 2025
[40]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2304.02020 , year=

A bibliometric review of large language models research from 2017 to 2023 , author=. arXiv preprint arXiv:2304.02020 , year=

work page arXiv 2017
[42]

arXiv preprint arXiv:2308.11432 , year=

A survey on large language model based autonomous agents , author=. arXiv preprint arXiv:2308.11432 , year=

work page arXiv
[43]

The Rise and Potential of Large Language Model Based Agents: A Survey

The rise and potential of large language model based agents: A survey , author=. arXiv preprint arXiv:2309.07864 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

GPT-4 Technical Report

GPT-4 Technical Report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

The Modern Language Journal , volume=

Beliefs about peer interaction and peer corrective feedback: Efficacy of classroom intervention , author=. The Modern Language Journal , volume=. 2013 , publisher=

work page 2013
[46]

BRAC University Journal , volume=

Peer correction in ESL classrooms , author=. BRAC University Journal , volume=. 2009 , publisher=

work page 2009
[47]

Profile Issues in TeachersProfessional Development , volume=

Self and peer correction to improve college students’ writing skills , author=. Profile Issues in TeachersProfessional Development , volume=. 2018 , publisher=

work page 2018
[48]

arXiv preprint arXiv:2310.08118 , year=

Can Large Language Models Really Improve by Self-critiquing Their Own Plans? , author=. arXiv preprint arXiv:2310.08118 , year=

work page arXiv
[49]

Large Language Models Cannot Self-Correct Reasoning Yet

Large Language Models Cannot Self-Correct Reasoning Yet , author=. arXiv preprint arXiv:2310.01798 , year=

work page internal anchor Pith review arXiv
[50]

Large language models still can’t plan (a benchmark for llms on planning and reasoning about change),

Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) , author=. arXiv preprint arXiv:2206.10498 , year=

work page arXiv
[51]

arXiv preprint arXiv:2310.12397 , year=

GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems , author=. arXiv preprint arXiv:2310.12397 , year=

work page arXiv
[52]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Reflexion: Language Agents with Verbal Reinforcement Learning

Reflexion: Language agents with verbal reinforcement learning , author=. arXiv preprint arXiv:2303.11366 , volume=. 2023 , publisher=

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Tree of thoughts: Deliberate problem solving with large language models , author=. arXiv preprint arXiv:2305.10601 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Inner monologue: Embodied reasoning through planning with language models , author=. arXiv preprint arXiv:2207.05608 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Language models can solve computer tasks

Language models can solve computer tasks , author=. arXiv preprint arXiv:2303.17491 , year=

work page arXiv
[57]

arXiv preprint arXiv:2212.09561 , year=

Large language models are reasoners with self-verification , author=. arXiv preprint arXiv:2212.09561 , year=

work page arXiv
[58]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Autogen: Enabling next-gen llm applications via multi-agent conversation framework , author=. arXiv preprint arXiv:2308.08155 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Exploring large language models for communication games: An empirical study on werewolf

Exploring large language models for communication games: An empirical study on werewolf , author=. arXiv preprint arXiv:2309.04658 , year=

work page arXiv
[60]

Generative Agents: Interactive Simulacra of Human Behavior

Generative agents: Interactive simulacra of human behavior , author=. arXiv preprint arXiv:2304.03442 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Metagpt: Meta programming for multi-agent collaborative framework , author=. arXiv preprint arXiv:2308.00352 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Algorithms and complexity , pages=

A catalog of complexity classes , author=. Algorithms and complexity , pages=. 1990 , publisher=

work page 1990
[63]

Evaluating the perfor- mance of large language models on gaokao benchmark.arXiv:2305.12474,

Evaluating the Performance of Large Language Models on GAOKAO Benchmark , author=. arXiv preprint arXiv:2305.12474 , year=

work page arXiv
[64]

Alpacafarm: A simulation framework for methods that learn from human feedback

Alpacafarm: A simulation framework for methods that learn from human feedback , author=. arXiv preprint arXiv:2305.14387 , year=

work page arXiv
[65]

Moss: Training conversa- tional language models from synthetic data

Superclue: A comprehensive chinese large language model benchmark , author=. arXiv preprint arXiv:2307.15020 , year=

work page arXiv
[66]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. arXiv preprint arXiv:1903.00161 , year=

work page Pith review arXiv 1903
[68]

Google DeepMind , author=

Gemini , url=. Google DeepMind , author=

work page
[69]

Microsoft Research Blog , author=

Phi-2 , url=. Microsoft Research Blog , author=

work page
[70]

2022 , publisher=

Introduction to Algorithms, fourth edition , author=. 2022 , publisher=

work page 2022
[71]

SIAM Journal on Computing , volume=

Set partitioning via inclusion-exclusion , author=. SIAM Journal on Computing , volume=. 2009 , publisher=

work page 2009
[72]

, title =

Held, Michael and Karp, Richard M. , title =. Journal of the Society for Industrial and Applied Mathematics , volume =. 1962 , doi =

work page 1962
[73]

arXiv preprint arXiv:2301.13867 , year=

Mathematical capabilities of chatgpt , author=. arXiv preprint arXiv:2301.13867 , year=

work page arXiv
[74]

arXiv preprint arXiv:2309.08632 , year=

Pretraining on the test set is all you need , author=. arXiv preprint arXiv:2309.08632 , year=

work page arXiv
[75]

International Journal of Computer and Information Technology , volume=

Applications of graph coloring in modern computer science , author=. International Journal of Computer and Information Technology , volume=

work page
[76]

2019 , eprint=

SOSD: A Benchmark for Learned Indexes , author=. 2019 , eprint=

work page 2019
[77]

Transportation Science , volume=

Exact methods for the traveling salesman problem with drone , author=. Transportation Science , volume=. 2021 , publisher=

work page 2021
[78]

Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=

Competitive algorithms for the online multiple knapsack problem with application to electric vehicle charging , author=. Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=. 2020 , publisher=

work page 2020
[79]

The knapsack problem and its applications to the cargo loading problem , author=. Anal. Appl. Math , volume=

work page
[80]

Journal of Artificial Intelligence Research , volume=

Constraint solving approaches to the business-to-business meeting scheduling problem , author=. Journal of Artificial Intelligence Research , volume=

work page
[81]

2011 30th International Conference of the Chilean Computer Science Society , pages=

Register allocation with graph coloring by ant colony optimization , author=. 2011 30th International Conference of the Chilean Computer Science Society , pages=. 2011 , organization=

work page 2011

Showing first 80 references.