pith. machine review for the scientific record. sign in

arxiv: 2605.08904 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

Haodong Duan, Jixuan Chen, Kai Chen, Qingwen Liu, Shengyuan Ding, Xiaozhe Li, Xinyu Fang

Pith reviewed 2026-05-12 02:48 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsself-improvementbenchmarkiterative optimizationfeedback loopsNP-hard problemsmachine learning tasksmodel capacity
0
0 comments X

The pith

LLMs self-improve iteratively through feedback but remain limited by base model capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OPT-BENCH to measure whether language models can keep refining solutions in large, changing search spaces by using environmental feedback to guide their own adjustments. It combines twenty machine learning tasks with ten classic NP-hard problems to create settings where rote recall is unlikely to suffice and genuine adaptation is required. The authors also introduce OPT-Agent, a simple loop that cycles through perception of the current state, memory of past attempts, and reasoning to propose the next change. Tests on nineteen models from seven families show that larger and more capable base models extract more benefit from each round of feedback, yet every model tested stops well short of human expert results. This finding matters because it separates the question of clever agent design from the deeper question of whether the underlying model already contains enough raw capacity to improve itself without external scaffolding.

Core claim

OPT-BENCH evaluates iterative self-optimization of LLM agents by placing them in combined machine-learning and NP-hard problem environments where they must repeatedly adjust solutions after receiving environmental feedback. OPT-Agent implements this process through a closed perception-memory-reasoning loop that updates internal state and generates the next candidate solution. Experiments across nineteen models demonstrate that stronger base models convert feedback into larger performance gains, yet the absolute ceiling of this improvement stays strictly determined by the model's initial capacity and never reaches human-expert levels.

What carries the argument

OPT-BENCH benchmark together with the OPT-Agent perception-memory-reasoning loop, which repeatedly reads environmental feedback and produces the next candidate solution in large discrete search spaces.

If this is right

  • Performance differences between model families will persist even when all models use identical adaptation loops.
  • Increasing the number of iterations will produce diminishing returns once a model's base capacity is reached.
  • Human-expert performance on these tasks will remain out of reach for any current LLM regardless of iteration count.
  • Self-optimization frameworks cannot substitute for improvements in the underlying model's training or scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future agent research may yield larger returns by first raising base model capacity before designing more elaborate feedback loops.
  • The benchmark could be reused to test whether hybrid systems that combine an LLM with an external optimizer can exceed the pure LLM ceiling.
  • Similar feedback loops might be applied to domains outside optimization, such as code refactoring or scientific hypothesis refinement, to check whether the capacity limit is domain-specific.

Load-bearing premise

The selected machine-learning tasks and NP-hard problems, together with the OPT-Agent loop, measure intrinsic self-reflection and adaptation rather than rewarding patterns or tool-use skills already present in the base models.

What would settle it

If weaker models show equal or larger relative gains than stronger models after the same number of feedback iterations on the benchmark tasks, the claim that stronger models leverage feedback more effectively would be refuted.

Figures

Figures reproduced from arXiv: 2605.08904 by Haodong Duan, Jixuan Chen, Kai Chen, Qingwen Liu, Shengyuan Ding, Xiaozhe Li, Xinyu Fang.

Figure 1
Figure 1. Figure 1: OPT-BENCH framework. This framework evaluates Iterative Self-Optimization by integrating two distinct reasoning modalities: Continuous Parametric Optimization for machine learning tasks (Top), and Discrete Combinatorial Reasoning for NP-hard problems (Bottom). In both settings, the agent leverages Environmental Feedback to guide its trajectory of improvement and debugging, progressively bridging the gap be… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the OPT-BENCH dataset and OPT-Agent Framework. The left panel illustrates the data structure of OPT-Bench, encompassing ML and NP problems. Each module includes problem definitions, dataset files, validation script (NP), evaluation metrics, and submission formats, integrating human-verified initial solutions and LLM-assisted refinement. The right panel details the evaluation workflow, where sol… view at source ↗
Figure 3
Figure 3. Figure 3: Specific cases from OPT-BENCH. Take the spaceship titanic classification task and the Hamiltonian cycle optimization problem as representative examples. metric quantifies the model’s ability to perform self-optimization. A high Win Count indicates that the model is successfully interpreting envi￾ronmental feedback and using it to guide perfor￾mance improvement. • Improvement Rate (IR): A quantitative mea￾s… view at source ↗
Figure 4
Figure 4. Figure 4: Prompt Template of OPT-Agent. Orange denotes draft action. Green denotes improve action. Purple denotes debug action. Blue denotes shared prompts [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Optimization Trajectories. The figure contrasts the agent’s path against the environment. In ML (Top), the agent uses error logs to monotonicially improve, demonstrating true self-optimization. In NP (Bottom), feedback often triggers erratic jumps, indicating a struggle to map discrete environmental signals to valid solution updates. Metric: 0.2765 Info: Hyperparameter Tuning Code: model = LGBMRegressor(n_… view at source ↗
Figure 6
Figure 6. Figure 6: Detailed OPT-Agent-ML Trace on the Bike Sharing Demand Task, utilizing gemini-2.0-flash as LLM base model. The red, and blue nodes represent the improve, and debug action, respectively [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Detailed OPT-Agent-NP Trace on the Hamiltonian Cycle Task, utilizing gemini-2.0-flash as the LLM base model. The yellow, red, and blue nodes represent the draft, improve, and debug action, respectively [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fixed prompts in OPT-Agent. This encompasses the response format, implementation guidelines, solution draft sketch guidelines, solution improvement sketch guidelines, and solution debug sketch guidelines for ML tasks, as well as example inputs and outputs, instructions, and response format for NP problems [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and tool use. However, the fundamental cognitive faculties essential for problem solving, including perception, reasoning, and memory, remain the stable core of intelligence. Unlike memorizing specific patterns, humans succeed in novel environments by applying these intrinsic faculties to adapt and optimize. Yet, whether LLMs possess this essential capacity, namely the ability to continuously refine solutions in response to dynamic environmental feedback, remains underexplored. To address this challenge, we introduce OPT-BENCH, a benchmark for evaluating self-improvement capabilities in large-scale search spaces. By combining 20 machine learning tasks with 10 classic NP-hard problems, OPT-BENCH provides a rigorous setting to assess whether agents can adapt through intrinsic self-reflection rather than rote tool application. We further propose OPT-Agent, a framework that emulates human-like cognitive adaptation. It operates through a general perception, memory, and reasoning loop, iteratively refining solutions based on environmental feedback. Through extensive experiments on 19 LLMs from 7 model families, including reasoning models, general models, and open-source models ranging from 3B to 235B parameters, we demonstrate that stronger models are more effective at leveraging feedback signals for self-improvement. However, this upper-bound adaptability remains fundamentally constrained by the models' base capacity, and even the most advanced LLMs still fall short of human expert performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces OPT-BENCH, a benchmark that combines 20 machine learning tasks with 10 classic NP-hard problems to evaluate iterative self-optimization in LLM agents operating over large search spaces. It proposes the OPT-Agent framework, which implements a perception-memory-reasoning loop that iteratively refines solutions using environmental feedback. Experiments across 19 LLMs from 7 families (3B to 235B parameters) are used to argue that stronger models more effectively exploit feedback for self-improvement, yet remain fundamentally limited by base model capacity and still lag human experts.

Significance. If the experimental controls isolate iterative adaptation from base-model reasoning, the work would supply a useful new benchmark and framework for studying LLM agent self-optimization beyond static prompting. The scale of the evaluation (19 models, 30 tasks) and the explicit comparison to human performance are strengths that could inform future agent design. The paper does not ship machine-checked proofs or parameter-free derivations, but the empirical scope is a positive contribution to the LLM-agent evaluation literature.

major comments (1)
  1. [Abstract and Experimental Results] The central claim that 'stronger models are more effective at leveraging feedback signals for self-improvement' and that adaptability 'remains fundamentally constrained by the models' base capacity' is load-bearing for the entire contribution. The manuscript reports comparative results across model families but provides no details on statistical controls, exact feedback mechanisms, or how rote tool use was ruled out (Abstract). In particular, no ablations are described that compare the full OPT-Agent iterative loop against single-pass or fixed-turn prompting with matched token budgets on the same 30 tasks. Without these controls, observed performance differences could simply reflect superior zero-shot reasoning or tool heuristics already present in stronger base models rather than differential ability to use the perception-memory-reasoning loop.
minor comments (2)
  1. [Abstract] The abstract states that OPT-BENCH 'provides a rigorous setting to assess whether agents can adapt through intrinsic self-reflection rather than rote tool application,' yet does not quantify the size of the search spaces or the number of iterations permitted; adding these numbers would improve clarity.
  2. [OPT-Agent Framework] Notation for the OPT-Agent loop (perception, memory, reasoning) is introduced without an accompanying diagram or pseudocode in the main text; a compact figure would aid readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed feedback. We agree that stronger controls are needed to isolate the effects of the iterative loop and will revise the manuscript accordingly to better substantiate our central claims.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] The central claim that 'stronger models are more effective at leveraging feedback signals for self-improvement' and that adaptability 'remains fundamentally constrained by the models' base capacity' is load-bearing for the entire contribution. The manuscript reports comparative results across model families but provides no details on statistical controls, exact feedback mechanisms, or how rote tool use was ruled out (Abstract). In particular, no ablations are described that compare the full OPT-Agent iterative loop against single-pass or fixed-turn prompting with matched token budgets on the same 30 tasks. Without these controls, observed performance differences could simply reflect superior zero-shot reasoning or tool heuristics already present in stronger base models rather than differential ability to use the perception-memory-reasoning loop.

    Authors: We acknowledge the importance of these controls for validating the load-bearing claims. In the revised version we will expand the Methods section with: (1) precise specifications of the feedback signals and perception-memory-reasoning loop implementation; (2) statistical controls including multiple independent runs per model-task pair, standard-error reporting, and paired significance tests; and (3) explicit discussion of why the benchmark tasks require iterative refinement beyond initial tool calls, thereby distinguishing the framework from rote tool use. Most critically, we will add the requested ablations: full OPT-Agent versus single-pass prompting and versus fixed-turn prompting, all with matched token budgets, evaluated on the identical 30 tasks. These new results will be presented in a dedicated subsection and will directly test whether performance gains arise from the iterative loop rather than base-model differences alone. We believe the added experiments will materially strengthen the paper. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or self-referential reductions

full rationale

The paper introduces OPT-BENCH and the OPT-Agent framework purely as an empirical evaluation tool, combining ML tasks and NP-hard problems to test LLM adaptation via experiments across 19 models. No equations, parameters, or derivations are presented that could reduce performance claims to fitted inputs or self-definitions by construction. Claims about stronger models leveraging feedback better (yet constrained by base capacity) rest on reported experimental outcomes rather than any load-bearing self-citation chain or ansatz smuggled from prior work. This matches the default case of a self-contained empirical benchmark with independent content from its evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that the selected tasks require genuine iterative adaptation rather than memorized patterns, and that the perception-memory-reasoning loop faithfully emulates human-like self-improvement. No free parameters or invented physical entities are introduced; the main additions are the benchmark construction and the agent architecture itself.

axioms (2)
  • domain assumption LLMs possess stable core faculties of perception, reasoning, and memory that can be applied to novel environments via iterative feedback.
    Stated in the opening paragraph as the stable core of intelligence that the benchmark aims to test.
  • domain assumption The 20 ML tasks plus 10 NP-hard problems constitute a rigorous setting for distinguishing self-reflection from rote tool application.
    Used to justify the benchmark design as the basis for all reported comparisons.
invented entities (2)
  • OPT-BENCH benchmark no independent evidence
    purpose: Provide a standardized large-scale search space for measuring self-improvement.
    Newly constructed testbed; independent evidence would be public release and adoption by other groups.
  • OPT-Agent framework no independent evidence
    purpose: Emulate human-like cognitive adaptation through a perception-memory-reasoning loop.
    Newly proposed agent architecture; independent evidence would be reproducible code and results on the benchmark.

pith-pipeline@v0.9.0 · 5579 in / 1617 out tokens · 68063 ms · 2026-05-12T02:48:05.512979+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

150 extracted references · 150 canonical work pages · 36 internal anchors

  1. [1]

    Chan, J. S. and Chowdhury, N. and Jaffe, O. and Aung, J. and Sherburn, D. and Mays, E. and Starace, G. and Liu, K. and Maksin, L. and Patwardhan, T. and Weng, L. and M

  2. [2]

    Artificial Intelligence , volume=

    Planning and acting in partially observable stochastic domains , author=. Artificial Intelligence , volume=. 1998 , publisher=

  3. [3]

    and Eggensperger, K

    Feurer, M. and Eggensperger, K. and Bergman, E. and Pfisterer, F. and Bischl, B. and Hutter, F. , title =

  4. [4]

    and Poirier, S

    LeDell, E. and Poirier, S. , title =. 2020 , pages =

  5. [5]

    Olson, R. S. and Moore, J. H. , title =

  6. [6]

    and Chollet, F

    Jin, H. and Chollet, F. and Song, Q. and Hu, X. , title =. Journal of Machine Learning Research , volume =

  7. [7]

    and Song, Q

    Jin, H. and Song, Q. and Hu, X. , title =

  8. [8]

    and Hutter, F

    Thornton, C. and Hutter, F. and Hoos, H. H. and Leyton-Brown, K. , title =. Journal of Machine Learning Research , volume =

  9. [9]

    and Hutter, F

    Thornton, C. and Hutter, F. and Hoos, H. H. and Leyton-Brown, K. , title =

  10. [10]

    and et al

    Mueller, J. and et al. , title =. Scientific Reports , volume =

  11. [11]

    and Xie, Y

    Wang, G. and Xie, Y. and Jiang, Y. and Mandlekar, A. and Xiao, C. and Zhu, Y. and Fan, L. and Anandkumar, A. , title =

  12. [12]

    Ma, Y. J. and Liang, W. and Wang, G. and Huang, D. and Bastani, O. and Jayaraman, D. and Zhu, Y. and Fan, L. and Anandkumar, A. , title =

  13. [13]

    and et al

    Li, Y. and et al. , title =. Science , volume =. 2022 , doi =

  14. [14]

    and Vora, J

    Huang, Q. and Vora, J. and Liang, P. and Leskovec, J. , title =

  15. [15]

    and Fourrier, C

    Mialon, G. and Fourrier, C. and Swift, C. and Wolf, T. and LeCun, Y. and Scialom, T. , title =

  16. [16]

    and et al

    Silver, D. and et al. , title =. Nature , volume =

  17. [17]

    and Klein, A

    Feurer, M. and Klein, A. and Eggensperger, K. and Springenberg, J. T. and Blum, M. and Hutter, F. , title =

  18. [18]

    and Klein, A

    Falkner, S. and Klein, A. and Hutter, F. , title =

  19. [19]

    and Zhao, J

    Yao, S. and Zhao, J. and Yu, D. and Du, N. and Shafran, I. and Narasimhan, K. and Cao, Y. , title =

  20. [20]

    and Song, K

    Shen, Y. and Song, K. and Tan, X. and Li, D. and Lu, W. and Zhuang, Y. , title =

  21. [21]

    and Dwivedi-Yu, J

    Schick, T. and Dwivedi-Yu, J. and Dessì, R. and Raileanu, R. and Lomeli, M. and Zettlemoyer, L. and Cancedda, N. and Scialom, T. , title =

  22. [22]

    and Le, Q

    Zoph, B. and Le, Q. V. , title =

  23. [23]

    and Guan, M

    Pham, H. and Guan, M. Y. and Zoph, B. and Le, Q. V. and Dean, J. , title =

  24. [24]

    and Simonyan, K

    Liu, H. and Simonyan, K. and Yang, Y. , title =

  25. [25]

    and Aggarwal, A

    Real, E. and Aggarwal, A. and Huang, Y. and Le, Q. V. , title =

  26. [26]

    and Shami, A

    Yang, L. and Shami, A. , title =. Neurocomputing , volume =

  27. [27]

    and Metzen, J

    Elsken, T. and Metzen, J. H. and Hutter, F. , title =. Journal of Machine Learning Research , volume =

  28. [28]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Wang, X. and Li, B. and Song, Y. and Xu, F. F. and Tang, X. and Zhuge, M. and Pan, J. and Song, Y. and Li, B. and Singh, J. and others , title =. arXiv preprint arXiv:2407.16741 , year =

  29. [29]

    2024 , url=

    Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

  30. [30]

    The Thirteenth International Conference on Learning Representations , year=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

  31. [32]

    2025 , institution =

    OpenAI o3-mini System Card , author=. 2025 , institution =

  32. [33]

    2025 , institution =

    OpenAI o1 System Card , author=. 2025 , institution =

  33. [34]

    2024 , url=

    Claude 3.5 sonnet model card addendum , author=. 2024 , url=

  34. [35]

    White, M

    Neural architecture search: Insights from 1000 papers , author=. arXiv preprint arXiv:2301.08727 , year=

  35. [36]

    Forty-first International Conference on Machine Learning , year=

    MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , author=. Forty-first International Conference on Machine Learning , year=

  36. [37]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

  37. [38]

    7th ICML Workshop on Automated Machine Learning (AutoML) , url =

    Erin LeDell and Sebastien Poirier , year =. 7th ICML Workshop on Automated Machine Learning (AutoML) , url =

  38. [39]

    2025 , eprint=

    AIDE: AI-Driven Exploration in the Space of Code , author=. 2025 , eprint=

  39. [40]

    A Survey of Large Language Models

    A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

  40. [41]

    arXiv preprint arXiv:2304.02020 , year=

    A bibliometric review of large language models research from 2017 to 2023 , author=. arXiv preprint arXiv:2304.02020 , year=

  41. [42]

    arXiv preprint arXiv:2308.11432 , year=

    A survey on large language model based autonomous agents , author=. arXiv preprint arXiv:2308.11432 , year=

  42. [43]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    The rise and potential of large language model based agents: A survey , author=. arXiv preprint arXiv:2309.07864 , year=

  43. [44]

    GPT-4 Technical Report

    GPT-4 Technical Report , author=. arXiv preprint arXiv:2303.08774 , year=

  44. [45]

    The Modern Language Journal , volume=

    Beliefs about peer interaction and peer corrective feedback: Efficacy of classroom intervention , author=. The Modern Language Journal , volume=. 2013 , publisher=

  45. [46]

    BRAC University Journal , volume=

    Peer correction in ESL classrooms , author=. BRAC University Journal , volume=. 2009 , publisher=

  46. [47]

    Profile Issues in TeachersProfessional Development , volume=

    Self and peer correction to improve college students’ writing skills , author=. Profile Issues in TeachersProfessional Development , volume=. 2018 , publisher=

  47. [48]

    arXiv preprint arXiv:2310.08118 , year=

    Can Large Language Models Really Improve by Self-critiquing Their Own Plans? , author=. arXiv preprint arXiv:2310.08118 , year=

  48. [49]

    Large Language Models Cannot Self-Correct Reasoning Yet

    Large Language Models Cannot Self-Correct Reasoning Yet , author=. arXiv preprint arXiv:2310.01798 , year=

  49. [50]

    Large language models still can’t plan (a benchmark for llms on planning and reasoning about change),

    Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) , author=. arXiv preprint arXiv:2206.10498 , year=

  50. [51]

    arXiv preprint arXiv:2310.12397 , year=

    GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems , author=. arXiv preprint arXiv:2310.12397 , year=

  51. [52]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  52. [53]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Reflexion: Language agents with verbal reinforcement learning , author=. arXiv preprint arXiv:2303.11366 , volume=. 2023 , publisher=

  53. [54]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Tree of thoughts: Deliberate problem solving with large language models , author=. arXiv preprint arXiv:2305.10601 , year=

  54. [55]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Inner monologue: Embodied reasoning through planning with language models , author=. arXiv preprint arXiv:2207.05608 , year=

  55. [56]

    Language models can solve computer tasks

    Language models can solve computer tasks , author=. arXiv preprint arXiv:2303.17491 , year=

  56. [57]

    arXiv preprint arXiv:2212.09561 , year=

    Large language models are reasoners with self-verification , author=. arXiv preprint arXiv:2212.09561 , year=

  57. [58]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Autogen: Enabling next-gen llm applications via multi-agent conversation framework , author=. arXiv preprint arXiv:2308.08155 , year=

  58. [59]

    Exploring large language models for communication games: An empirical study on werewolf

    Exploring large language models for communication games: An empirical study on werewolf , author=. arXiv preprint arXiv:2309.04658 , year=

  59. [60]

    Generative Agents: Interactive Simulacra of Human Behavior

    Generative agents: Interactive simulacra of human behavior , author=. arXiv preprint arXiv:2304.03442 , year=

  60. [61]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Metagpt: Meta programming for multi-agent collaborative framework , author=. arXiv preprint arXiv:2308.00352 , year=

  61. [62]

    Algorithms and complexity , pages=

    A catalog of complexity classes , author=. Algorithms and complexity , pages=. 1990 , publisher=

  62. [63]

    Evaluating the perfor- mance of large language models on gaokao benchmark.arXiv:2305.12474,

    Evaluating the Performance of Large Language Models on GAOKAO Benchmark , author=. arXiv preprint arXiv:2305.12474 , year=

  63. [64]

    Alpacafarm: A simulation framework for methods that learn from human feedback

    Alpacafarm: A simulation framework for methods that learn from human feedback , author=. arXiv preprint arXiv:2305.14387 , year=

  64. [65]

    Moss: Training conversa- tional language models from synthetic data

    Superclue: A comprehensive chinese large language model benchmark , author=. arXiv preprint arXiv:2307.15020 , year=

  65. [66]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

  66. [67]

    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

    DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. arXiv preprint arXiv:1903.00161 , year=

  67. [68]

    Google DeepMind , author=

    Gemini , url=. Google DeepMind , author=

  68. [69]

    Microsoft Research Blog , author=

    Phi-2 , url=. Microsoft Research Blog , author=

  69. [70]

    2022 , publisher=

    Introduction to Algorithms, fourth edition , author=. 2022 , publisher=

  70. [71]

    SIAM Journal on Computing , volume=

    Set partitioning via inclusion-exclusion , author=. SIAM Journal on Computing , volume=. 2009 , publisher=

  71. [72]

    , title =

    Held, Michael and Karp, Richard M. , title =. Journal of the Society for Industrial and Applied Mathematics , volume =. 1962 , doi =

  72. [73]

    arXiv preprint arXiv:2301.13867 , year=

    Mathematical capabilities of chatgpt , author=. arXiv preprint arXiv:2301.13867 , year=

  73. [74]

    arXiv preprint arXiv:2309.08632 , year=

    Pretraining on the test set is all you need , author=. arXiv preprint arXiv:2309.08632 , year=

  74. [75]

    International Journal of Computer and Information Technology , volume=

    Applications of graph coloring in modern computer science , author=. International Journal of Computer and Information Technology , volume=

  75. [76]

    2019 , eprint=

    SOSD: A Benchmark for Learned Indexes , author=. 2019 , eprint=

  76. [77]

    Transportation Science , volume=

    Exact methods for the traveling salesman problem with drone , author=. Transportation Science , volume=. 2021 , publisher=

  77. [78]

    Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=

    Competitive algorithms for the online multiple knapsack problem with application to electric vehicle charging , author=. Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=. 2020 , publisher=

  78. [79]

    The knapsack problem and its applications to the cargo loading problem , author=. Anal. Appl. Math , volume=

  79. [80]

    Journal of Artificial Intelligence Research , volume=

    Constraint solving approaches to the business-to-business meeting scheduling problem , author=. Journal of Artificial Intelligence Research , volume=

  80. [81]

    2011 30th International Conference of the Chilean Computer Science Society , pages=

    Register allocation with graph coloring by ant colony optimization , author=. 2011 30th International Conference of the Chilean Computer Science Society , pages=. 2011 , organization=

Showing first 80 references.