arxiv: 2605.08756 · v1 · submitted 2026-05-09 · 💻 cs.AI · cs.NE

Recognition: 2 theorem links

· Lean Theorem

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

Haoze Lv, Ning Lu, Shengcai Liu, Ziang Zhou

Pith reviewed 2026-05-12 01:10 UTC · model grok-4.3

classification 💻 cs.AI cs.NE

keywords automatic heuristic designagentic reinforcement learningLLM agentscombinatorial optimizationtool-integrated agentsheuristic discoverymulti-turn decision making

0 comments

The pith

A 4B-parameter agent trained with agentic reinforcement learning matches or surpasses larger models in automatic heuristic design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AHD Agent as a framework that lets language models act as agents deciding dynamically whether to output a heuristic or call tools to pull specific evidence from the current solving state. Training happens through reinforcement learning on environments built by a new synthesis pipeline that supplies the signals needed for good decisions. The result is a compact 4B model that equals or beats existing methods built around much larger models while using far fewer evaluations. This holds across eight different combinatorial optimization domains, four of which the model never encountered during training. A reader would care because the work points to a practical route for making heuristic discovery less dependent on massive models and more autonomous.

Core claim

Framing automatic heuristic design as a multi-turn agentic process, where the model can choose at each step to generate a heuristic or invoke tools for targeted evidence, and then training the resulting policy with reinforcement learning on a novel environment synthesis pipeline, produces a 4B-parameter model whose performance matches or exceeds that of state-of-the-art baselines relying on substantially larger models while requiring significantly fewer evaluations on both training and held-out tasks.

What carries the argument

The AHD Agent: a tool-integrated multi-turn framework in which the language model proactively chooses between heuristic generation and tool invocation to retrieve state-dependent evidence.

If this is right

Compact models become sufficient for competitive automatic heuristic design.
Proactive tool use reduces the number of trials needed compared with passive generation.
Performance extends to tasks absent from the reinforcement learning phase.
The approach supplies a concrete trajectory toward fully autonomous heuristic discovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis-plus-RL pattern could be tested for training agents in other domains that require adaptive tool selection.
Lower evaluation counts may allow the method to scale to larger problem instances than current baselines can handle.
If the decision policy proves robust, future versions could reduce reliance on hand-designed prompts in heuristic search systems.

Load-bearing premise

The environment synthesis pipeline generates training signals that produce decision-making skills transferable to optimization tasks the agent has not seen during reinforcement learning.

What would settle it

Showing that the 4B agent needs more evaluations or performs worse than larger baselines on a fresh collection of held-out combinatorial optimization tasks would falsify the claim of generalizable performance.

Figures

Figures reproduced from arXiv: 2605.08756 by Haoze Lv, Ning Lu, Shengcai Liu, Ziang Zhou.

**Figure 1.** Figure 1: Traditional LLM-based AHD vs. our AHD Agent. Traditional AHD places the LLM inside a fixed loop. AHD Agent enables the LLM to design heuristics by actively calling tools, generating candidates, and performing evaluations. However, existing LLM-based AHD frameworks (e.g., EoH [9], ReEvo [10]) still face key limitations. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: further shows that simply providing all available information (tools) to LLMs within these fixed workflows brings limited gains and may even hurt performance, suggesting that the key challenge is not information availability alone, but the lack of state-dependent mechanisms for acquiring and using relevant information. Additionally, existing frameworks typically use general-purpose LLMs that are not speci… view at source ↗

**Figure 3.** Figure 3: Demonstration of AHD Agent workflow. Given a problem description, a seed heuristic, and a set of tools, the model iteratively decides its next action based on the session history, which records all previous interactions. At each turn, it can call tools, generate and evaluate heuristics, and finally return the best heuristic. 3.1.2 Agent Loop [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling Effect of AHD Agent. (a) Inference-time scaling comparison. SR strategy outperforms PS strategy on two tasks. (b) Model scaling favors AHD Agent. Performance of AHD Agent increases as the model size increases from 30B to 397B parameters. 1 25 50 75 100 Evaluator calls 9 10 11 12 Best-so-far objective CVRP-ACO 1 25 50 75 100 Evaluator calls 13 14 15 OP-ACO EOH ReEvo MCTS-AHD AHD Agent AHD Agent w/ S… view at source ↗

**Figure 5.** Figure 5: Training curves during design. AHD Agent converges faster and achieves better performance under larger evaluation budgets. 0 200 400 RL update step -0.5 0.0 0.5 1.0 Reward 0 200 400 RL update step 20 40 60 Number of Turn [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Performance changes as the number of training domains increases. Cross-domain RL training increases the general AHD capability [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Objective-value distribution of the source heuristic pools used to generate the RL training [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: RL training diagnostics over 500 steps. Top-left: quality reward (higher is better). Top [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Per-domain train-side validation curves over RL training steps. Each panel reports the [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Mean validation Gap (%) across the three reported problem sizes as the GPT-series [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

read the original abstract

Automatic heuristic design (AHD) has emerged as a promising paradigm for solving NP-hard combinatorial optimization problems (COPs). Recent works show that large language models (LLMs), when integrated into well-designed frameworks (i.e., LLM-AHD), can autonomously discover high-performing heuristics. However, existing LLM-AHD frameworks typically treat LLMs as passive generators within fixed workflows, where the model generates heuristics from manually designed, limited context. Such context may fail to capture state-dependent information (e.g., specific failure modes), leading to inefficient trial-and-error exploration. To overcome these limitations, we propose AHD Agent, a novel tool-integrated, multi-turn framework that empowers LLMs to proactively decide whether to generate heuristics or invoke tools to retrieve targeted evidence from the solving environment. To effectively train such a dynamic decision-making agent, we introduce an agentic reinforcement learning (RL) system, which leverages a novel environment synthesis pipeline to optimize a compact model's generalizable AHD capabilities. Experiments across eight diverse domains, including four held-out tasks, demonstrate that our 4B-parameter agent matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations. Model and inference scaling analysis further reveals that AHD Agent offers an effective trajectory toward truly autonomous heuristic design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper trains a 4B LLM as an agent that decides when to call tools for state info and uses RL on a synthesized environment pipeline to match larger models on heuristic design across domains including held-out ones.

read the letter

The main takeaway is that this work moves automatic heuristic design away from passive LLM generation toward a proactive agent that can make multi-turn tool calls to pull targeted evidence from the solver, then trains that behavior with agentic RL on a new environment synthesis pipeline so a compact model generalizes across eight domains and four held-out tasks while using fewer evaluations than bigger baselines.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces AHD Agent, a tool-integrated multi-turn framework for automatic heuristic design (AHD) in NP-hard combinatorial optimization problems. LLMs are trained via agentic reinforcement learning on a novel environment synthesis pipeline to enable proactive decisions between generating heuristics and invoking tools for targeted evidence retrieval from the solving environment. The central claim is that the resulting 4B-parameter agent matches or surpasses state-of-the-art baselines (using much larger models) across eight diverse domains, including four held-out tasks, while requiring significantly fewer evaluations.

Significance. If the empirical results hold under rigorous scrutiny, the work could advance LLM-based AHD by moving beyond passive generation in fixed workflows to an active, tool-using agent trained with RL. The compact model size combined with reduced evaluations and apparent generalization to held-out tasks would indicate a practical path toward more efficient autonomous heuristic discovery. The environment synthesis pipeline for creating RL training signals is a potentially valuable technical contribution if it demonstrably supports out-of-distribution performance.

major comments (2)

[Abstract] Abstract: The abstract asserts that the 4B-parameter agent 'matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations' across eight domains including four held-out tasks. However, it supplies no details on the baselines, metrics, number of independent runs, statistical tests, ablation studies, or experimental protocols. This prevents assessment of whether the data support the central performance claim.
[Experiments section] Environment synthesis pipeline and held-out tasks (Experiments section): The generalizability claim to four held-out tasks is load-bearing and depends on the novel environment synthesis pipeline producing training signals that enable out-of-distribution AHD decision-making. The manuscript must provide explicit distribution-shift metrics or diversity controls demonstrating that synthesized environments differ structurally from the held-out set (e.g., in problem classes, constraint types, or instance distributions); absent this, the reported performance could reflect training-distribution overlap rather than the agentic multi-turn framework.

minor comments (1)

[Abstract] The abstract uses 'significantly fewer evaluations' without any quantification or comparison numbers; adding approximate ratios or absolute figures would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts that the 4B-parameter agent 'matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations' across eight domains including four held-out tasks. However, it supplies no details on the baselines, metrics, number of independent runs, statistical tests, ablation studies, or experimental protocols. This prevents assessment of whether the data support the central performance claim.

Authors: We agree that the abstract is high-level and omits specific experimental details due to length constraints. In the revised manuscript, we will expand the abstract to briefly name the primary baselines (including model sizes), the main metrics (solution quality and evaluation counts), and note that results are reported as averages over multiple independent runs with statistical tests detailed in the Experiments section. Full protocols, ablations, and significance results remain in the main text. This change directly addresses the concern while preserving abstract conciseness. revision: yes
Referee: [Experiments section] Environment synthesis pipeline and held-out tasks (Experiments section): The generalizability claim to four held-out tasks is load-bearing and depends on the novel environment synthesis pipeline producing training signals that enable out-of-distribution AHD decision-making. The manuscript must provide explicit distribution-shift metrics or diversity controls demonstrating that synthesized environments differ structurally from the held-out set (e.g., in problem classes, constraint types, or instance distributions); absent this, the reported performance could reflect training-distribution overlap rather than the agentic multi-turn framework.

Authors: We acknowledge the need for explicit evidence of distribution shift to support the held-out task claims. The environment synthesis pipeline (Section 3.2) generates training environments by varying problem parameters, constraint structures, and instance features across domains. In the revision, we add quantitative distribution-shift analyses, including comparisons of problem classes, constraint types, and instance distributions (via new tables and divergence metrics) between synthesized training data and the four held-out tasks. These demonstrate structural differences and reinforce that gains arise from the agentic RL framework. An ablation on the pipeline's role in generalization is also included. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical evaluation of a novel pipeline, not self-referential definitions or fits

full rationale

The paper presents an empirical method (AHD Agent) whose central claims are performance outcomes measured on eight domains including four held-out tasks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or described methodology. The environment synthesis pipeline is introduced as an external innovation whose value is tested by downstream RL training and generalization metrics rather than being defined in terms of the target results. No load-bearing step reduces a prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified effectiveness of the environment synthesis pipeline and the assumption that RL can instill generalizable tool-use policies in LLMs for heuristic design.

axioms (1)

domain assumption LLMs can learn effective dynamic decision policies for tool invocation versus generation through reinforcement learning on synthesized environments.
This underpins the agentic RL training system introduced to optimize the AHD agent.

pith-pipeline@v0.9.0 · 5529 in / 1259 out tokens · 62717 ms · 2026-05-12T01:10:37.206351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
AHD Agent treats the LLM as the decision-making agent in a multi-turn design process... GRPO... AHD environment synthesis pipeline
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
Experiments across eight diverse domains, including four held-out tasks

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 7 internal anchors

[1]

Traveling salesman problem: an overview of ap- plications, formulations, and solution approaches,

R. Matai, S. P . Singh, and M. L. Mittal, “Traveling salesman problem: an overview of ap- plications, formulations, and solution approaches,” Traveling salesman problem, theory and applications, vol. 1, no. 1, pp. 1–25, 2010

work page 2010
[2]

Heuristic algorithm for scheduling in a ﬂowshop to minimize total ﬂowtime,

C. Rajendran, “Heuristic algorithm for scheduling in a ﬂowshop to minimize total ﬂowtime,” International Journal of Production Economics , vol. 29, no. 1, pp. 65–73, 1993

work page 1993
[3]

Heuristic and meta-heuristic algorithms and their relevance to the real world: a survey,

S. Desale, A. Rasool, S. Andhale, and P . Rane, “Heuristic and meta-heuristic algorithms and their relevance to the real world: a survey,” Int. J. Comput. Eng. Res. Trends , vol. 351, no. 5, pp. 2349–7084, 2015

work page 2015
[4]

A classiﬁcation of hyper-heuristic approaches,

E. K. Burke, M. Hyde, G. Kendall, G. Ochoa, E. Özcan, and J. R. Woodward, “A classiﬁcation of hyper-heuristic approaches,” in Handbook of metaheuristics. Springer, 2010, pp. 449–468

work page 2010
[5]

W. B. Langdon and R. Poli, F oundations of genetic programming. Springer, 2002, vol. 90

work page 2002
[6]

Explainable artiﬁcial intelligence by genetic programming: A survey,

Y . Mei, Q. Chen, A. Lensen, B. Xue, and M. Zhang, “Explainable artiﬁcial intelligence by genetic programming: A survey,” IEEE Transactions on Evolutionary Computation , vol. 27, no. 3, pp. 621–641, 2022

work page 2022
[7]

Mathematical discoveries from program search with large language models,

B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P . Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P . Wang, O. Fawziet al., “Mathematical discoveries from program search with large language models,” Nature, vol. 625, no. 7995, pp. 468–475, 2024

work page 2024
[8]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P .-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. Ruiz, A. Mehrabian et al., “Alphaevolve: A coding agent for scientiﬁc and algorithmic discovery,”arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Evolution of heuristics: Towards efficient automatic algorithm design using large language model,

F. Liu, X. Tong, M. Y uan, X. Lin, F. Luo, Z. Wang, Z. Lu, and Q. Zhang, “Evolution of heuristics: Towards efﬁcient automatic algorithm design using large language model,” arXiv preprint arXiv:2401.02051, 2024

work page arXiv 2024
[10]

Reevo: Large language models as hyper-heuristics with reﬂective evolution,

H. Y e, J. Wang, Z. Cao, F. Berto, C. Hua, H. Kim, J. Park, and G. Song, “Reevo: Large language models as hyper-heuristics with reﬂective evolution,”Advances in neural information processing systems, vol. 37, pp. 43 571–43 608, 2024

work page 2024
[11]

Deepseek-v4: Towards highly efﬁcient million-token context intelligence,

DeepSeek-AI, “Deepseek-v4: Towards highly efﬁcient million-token context intelligence,” 2026

work page 2026
[12]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Z. Wang, K. Wang, Q. Wang, P . Zhang, L. Li, Z. Y ang, K. Y u, M. N. Nguyen, L. Liu, E. Gottlieb et al. , “RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforcement learning,” arXiv preprint arXiv:2504.20073, 2025

work page internal anchor Pith review arXiv 2025
[13]

Monte carlo tree search for comprehensive explo- ration in llm-based automatic heuristic design,

Z. Zheng, Z. Xie, Z. Wang, and B. Hooi, “Monte carlo tree search for comprehensive explo- ration in llm-based automatic heuristic design,” arXiv preprint arXiv:2501.08603, 2025

work page arXiv 2025
[14]

Eoh-s: Evolution of heuristic set using llms for automated heuristic design,

F. Liu, Y . Liu, Q. Zhang, T. Xialiang, and M. Y uan, “Eoh-s: Evolution of heuristic set using llms for automated heuristic design,” in Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 40, no. 43, 2026, pp. 37 090–37 098. 10

work page 2026
[15]

Hsevo: Elevating automatic heuristic design with diversity-driven harmony search and genetic algorithm using llms,

P . V . T. Dat, L. Doan, and H. T. T. Binh, “Hsevo: Elevating automatic heuristic design with diversity-driven harmony search and genetic algorithm using llms,” inProceedings of the AAAI Conference on Artiﬁcial Intelligence , vol. 39, no. 25, 2025, pp. 26 931–26 938

work page 2025
[16]

Generalizable heuristic generation through LLMs with meta-optimization,

Y . Shi, J. Zhou, W. Song, J. Bi, Y . Wu, Z. Cao, and J. Zhang, “Generalizable heuristic generation through LLMs with meta-optimization,” in The F ourteenth International Conference on Learning Representations , 2026. [Online]. Available: https://openreview.net/ forum?id=tIQZ7pVN6S

work page 2026
[17]

VRPAgent: LLM-driven discovery of heuristic operators for vehicle routing problems

A. Hottung, F. Berto, C. Hua, N. G. Zepeda, D. Wetzel, M. Römer, H. Y e, D. Zago, M. Poli, S. Massaroli et al., “Vrpagent: Llm-driven discovery of heuristic operators for vehicle routing problems,” arXiv preprint arXiv:2510.07073, 2025

work page arXiv 2025
[18]

Llm-assisted automatic memetic algorithm for lot-streaming hybrid job shop scheduling with variable sublots,

R. Li, L. Wang, H. Sang, L. Y ao, and L. Pan, “Llm-assisted automatic memetic algorithm for lot-streaming hybrid job shop scheduling with variable sublots,” IEEE Transactions on Evolutionary Computation, 2025

work page 2025
[19]

Dhevo: Data-algorithm based heuristic evolution for generalizable milp solving,

Z. Zhang, S. Li, C. Li, F. Liu, M. Chen, K. Li, T. Zhong, B. An, and P . Liu, “Dhevo: Data-algorithm based heuristic evolution for generalizable milp solving,” arXiv preprint arXiv:2507.15615, 2025

work page arXiv 2025
[20]

Dasathco: Data-aware sat heuristics combinations optimization via large language models,

M. Chen and G. Li, “Dasathco: Data-aware sat heuristics combinations optimization via large language models,” arXiv preprint arXiv:2509.12602, 2025

work page arXiv 2025
[21]

Llm-driven instance-speciﬁc heuristic generation and selection,

S. Zhang, S. Liu, N. Lu, J. Wu, J. Liu, Y .-S. Ong, and K. Tang, “Llm-driven instance-speciﬁc heuristic generation and selection,” arXiv preprint arXiv:2506.00490, 2026

work page arXiv 2026
[22]

Algorithm discovery with LLMs: Evolutionary search meets reinforcement learning

A. Surina, A. Mansouri, L. Quaedvlieg, A. Seddas, M. Viazovska, E. Abbe, and C. Gulcehre, “Algorithm discovery with llms: Evolutionary search meets reinforcement learning,” arXiv preprint arXiv:2504.05108, 2025

work page arXiv 2025
[23]

Reﬁning hybrid genetic search for CVRP via reinforcement learning-ﬁnetuned LLM,

R. Zhu, C. Zhang, and Z. Cao, “Reﬁning hybrid genetic search for CVRP via reinforcement learning-ﬁnetuned LLM,” in The F ourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=aITKXFeivk

work page 2026
[24]

CALM: Co-evolution of algorithms and language model for automatic heuristic design,

Z. Huang, W. Wu, K. Wu, W.-B. Lee, and J. Wang, “CALM: Co-evolution of algorithms and language model for automatic heuristic design,” in The F ourteenth International Conference on Learning Representations , 2026. [Online]. Available: https: //openreview.net/forum?id=x6bG2Hoqdf

work page 2026
[25]

LLMOPT: learning to de- ﬁne and solve general optimization problems from scratch,

C. Jiang, X. Shu, H. Qian, X. Lu, J. Zhou, A. Zhou, and Y . Y u, “LLMOPT: learning to de- ﬁne and solve general optimization problems from scratch,” in Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , Singapore, Singapore, 2025

work page 2025
[26]

Large language models as evolutionary optimizers,

S. Liu, C. Chen, X. Qu, K. Tang, and Y .-S. Ong, “Large language models as evolutionary optimizers,” in 2024 IEEE Congress on Evolutionary Computation (CEC) . IEEE, 2024, pp. 1–8

work page 2024
[27]

CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,

K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin, “CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2024, pp. 13 643–13 658

work page 2024
[28]

Y ou only look at screens: Multimodal chain-of-action agents,

Z. Zhang and A. Zhang, “Y ou only look at screens: Multimodal chain-of-action agents,” in Findings of the Association for Computational Linguistics ACL 2024 , 2024, pp. 3132–3149

work page 2024
[29]

The dawn of GUI agent: A preliminary case study with claude 3.5 computer use,

S. Hu, M. Ouyang, D. Gao, and M. Z. Shou, “The dawn of GUI agent: A preliminary case study with claude 3.5 computer use,” arXiv preprint arXiv:2411.10323, 2024

work page arXiv 2024
[30]

V oyager: An open-ended embodied agent with large language models,

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,” Transactions on Machine Learning Research , 2024. [Online]. Available: https://openreview.net/forum?id=ehfRiF0R3a 11

work page 2024
[31]

RT-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Y u, S. Xu, P . Xu, T. Xiao, F. Xia, J. Wu, P . Wohlhart, S. Welker, A. Wahid et al. , “RT-2: Vision-language-action models transfer web knowledge to robotic control,” in Conference on Robot Learning . PMLR, 2023, pp. 2165–2183

work page 2023
[32]

ReAct: Synergizing reasoning and acting in language models,

S. Y ao, J. Zhao, D. Y u, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in The Eleventh International Conference on Learning Representations , 2023. [Online]. Available: https: //openreview.net/forum?id=WE_vluYUL-X

work page 2023
[33]

Reﬂexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Y ao, “Reﬂexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024
[34]

Mobile- Agent-v2: Mobile device operation assistant with effective navigation via multi-agent collabo- ration,

J. Wang, H. Xu, H. Jia, X. Zhang, M. Y an, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile- Agent-v2: Mobile device operation assistant with effective navigation via multi-agent collabo- ration,” Advances in Neural Information Processing Systems , vol. 37, pp. 2686–2710, 2024

work page 2024
[35]

Cradle: Empowering foundation agents towards general computer control,

W. Tan, W. Zhang, X. Xu, H. Xia, G. Ding, B. Li, B. Zhou, J. Y ue, J. Jiang, Y . Liet al., “Cradle: Empowering foundation agents towards general computer control,” inNeurIPS 2024 Workshop on Open-World Agents, 2024

work page 2024
[36]

Large language models can be guided to evade ai-generated text detection,

N. Lu, S. Liu, R. He, Y . Ong, Q. Wang, and K. Tang, “Large language models can be guided to evade ai-generated text detection,” TMLR, 2024

work page 2024
[37]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Y u, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Information Processing Systems , vol. 36, pp. 68 539–68 551, 2023

work page 2023
[38]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Y u, “OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” in The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2024

work page 2024
[39]

Why Agents Compromise Safety Under Pressure

H. Jiang and K. Tang, “Why agents compromise safety under pressure,” arXiv preprint arXiv:2603.14975, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

B. Jin, H. Zeng, Z. Y ue, J. Y oon, S. Arik, D. Wang, H. Zamani, and J. Han, “Search-r1: Train- ing llms to reason and leverage search engines with reinforcement learning,” arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Kevin: Multi-turn RL for generating CUDA kernels,

C. Baronio, P . Marsella, B. Pan, S. Guo, and S. Alberti, “Kevin: Multi-turn RL for generating CUDA kernels,” in The F ourteenth International Conference on Learning Representations , 2026

work page 2026
[42]

Is PRM necessary? problem-solving RL implicitly induces PRM capability in LLMs,

Z. Feng, Q. Chen, N. Lu, Y . Li, S. Cheng, S. Peng, D. Tang, S. Liu, and Z. Zhang, “Is PRM necessary? problem-solving RL implicitly induces PRM capability in LLMs,” in NeurIPS, 2025

work page 2025
[43]

Train at moving edge: Online- veriﬁed prompt selection for efﬁcient rl training of large reasoning model,

J. Wu, N. Lu, S. Liu, K. Wang, Y . Y ang, L. Qing, and K. Tang, “Train at moving edge: Online- veriﬁed prompt selection for efﬁcient rl training of large reasoning model,” arXiv preprint arXiv:2603.25184, 2026

work page arXiv 2026
[44]

Reasoning-aligned perception decoupling for scalable multi-modal reasoning,

Y . Gou, K. Chen, Z. Liu, L. HONG, X. Jin, Z. Li, J. Kwok, and Y . Zhang, “Reasoning-aligned perception decoupling for scalable multi-modal reasoning,” in The F ourteenth International Conference on Learning Representations, 2026

work page 2026
[45]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimiza- tion algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Mastering the game of go without human knowledge,

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017. 12

work page 2017
[47]

Buy 4 reinforce samples, get a baseline for free!

W. Kool, H. van Hoof, and M. Welling, “Buy 4 reinforce samples, get a baseline for free!” in ICLR 2019 Workshop, 2019

work page 2019
[48]

The traveling salesman problem: a guided tour of combinatorial optimization,

E. L. Lawler, “The traveling salesman problem: a guided tour of combinatorial optimization,” Wiley-Interscience Series in Discrete Mathematics, 1985

work page 1985
[49]

Llm4ad: A platform for algorithm design with large language model

F. Liu, R. Zhang, Z. Xie, R. Sun, K. Li, Q. Hu, P . Guo, X. Lin, X. Tong, M. Y uan et al., “Llm4ad: A platform for algorithm design with large language model,” arXiv preprint arXiv:2412.17287, 2024

work page arXiv 2024
[50]

Ant colony optimization,

M. Dorigo, M. Birattari, and T. Stutzle, “Ant colony optimization,” IEEE computational intel- ligence magazine, vol. 1, no. 4, pp. 28–39, 2006

work page 2006
[51]

Evolve cost-aware acquisition functions using large language models,

Y . Y ao, F. Liu, J. Cheng, and Q. Zhang, “Evolve cost-aware acquisition functions using large language models,” in International Conference on Parallel Problem Solving from Nature. Springer, 2024, pp. 374–390

work page 2024
[52]

An analysis of several heuristics for the traveling salesman problem,

D. J. Rosenkrantz, R. E. Stearns, and P . M. Lewis, II, “An analysis of several heuristics for the traveling salesman problem,” SIAM journal on computing , vol. 6, no. 3, pp. 563–581, 1977

work page 1977
[53]

Improving ant colony optimization efﬁciency for solving large tsp in- stances,

R. Skinderowicz, “Improving ant colony optimization efﬁciency for solving large tsp in- stances,” Applied Soft Computing, vol. 120, p. 108653, 2022

work page 2022
[54]

A dynamic space reduction ant colony optimization for capacitated vehicle routing problem,

J. Cai, P . Wang, S. Sun, and H. Dong, “A dynamic space reduction ant colony optimization for capacitated vehicle routing problem,” Soft Computing, vol. 26, no. 17, pp. 8745–8756, 2022

work page 2022
[55]

Acs-ophs: Ant colony system for the orienteering problem with hotel selection,

S. Sohrabi, K. Ziarati, and M. Keshtkaran, “Acs-ophs: Ant colony system for the orienteering problem with hotel selection,” EURO Journal on Transportation and Logistics , vol. 10, p. 100036, 2021

work page 2021
[56]

Hybrid ant colony optimization algorithm for multiple knapsack problem,

S. Fidanova, “Hybrid ant colony optimization algorithm for multiple knapsack problem,” in 2020 5th IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE). IEEE, 2020, pp. 1–5

work page 2020
[57]

GPT-4o System Card

A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford et al., “Gpt-4o system card,” arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Qwen3 technical report,

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505. 09388

work page 2025
[59]

Taking the human out of the loop: A review of bayesian optimization,

B. Shahriari, K. Swersky, Z. Wang, R. P . Adams, and N. De Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE , vol. 104, no. 1, pp. 148–175, 2015

work page 2015
[60]

On bayesian methods for seeking the extremum,

J. Mo ˇckus, “On bayesian methods for seeking the extremum,” in IFIP Technical Conference on Optimization Techniques. Springer, 1974, pp. 400–404

work page 1974
[61]

Practical bayesian optimization of machine learning algorithms,

J. Snoek, H. Larochelle, and R. P . Adams, “Practical bayesian optimization of machine learning algorithms,” Advances in neural information processing systems , vol. 25, 2012

work page 2012
[62]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu et al. , “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,

K. Helsgaun, “An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,” Roskilde: Roskilde University , vol. 12, pp. 966–980, 2017

work page 2017
[64]

Pyvrp: A high-performance vrp solver package,

N. A. Wouda, L. Lan, and W. Kool, “Pyvrp: A high-performance vrp solver package,” IN- FORMS Journal on Computing , vol. 36, no. 4, pp. 943–955, 2024

work page 2024
[65]

A revisited branch-and-cut algorithm for large-scale orienteering problems,

G. Kobeaga, J. Rojas-Delgado, M. Merino, and J. A. Lozano, “A revisited branch-and-cut algorithm for large-scale orienteering problems,” European Journal of Operational Research , vol. 313, no. 1, pp. 44–68, 2024

work page 2024
[66]

Or-tools,

L. Perron and V . Furnon, “Or-tools,” Google. [Online]. Available: https://developers.google. com/optimization/ 13 A Details of Problem Domains A.1 Problem Domain Deﬁnitions We evaluate on eight problem domains spanning combinatorial and continuous optimization. Each subsection below states the mathematical formulation, the training/validation instance si...

work page
[67]

InstanceAnalysis: summarize structural properties of the training instances, such as spacing, clustering, density, boundary statistics, and task-speciﬁc attributes when available

work page
[68]

Optimal” and “Baseline heuristic

ASTNoveltyAnalyzer: compare the AST structure of a candidate against previously evaluated candidates. This interface is used only as a novelty checkpoint; ﬁnal ranking is always deter- mined by train evaluation. Interaction rules. Use diagnostic feedback and train evaluation results to revise the code over multiple turns. Do not submit the initial code un...

work page 2000