pith. machine review for the scientific record. sign in

arxiv: 2605.08756 · v1 · submitted 2026-05-09 · 💻 cs.AI · cs.NE

Recognition: 2 theorem links

· Lean Theorem

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

Haoze Lv, Ning Lu, Shengcai Liu, Ziang Zhou

Pith reviewed 2026-05-12 01:10 UTC · model grok-4.3

classification 💻 cs.AI cs.NE
keywords automatic heuristic designagentic reinforcement learningLLM agentscombinatorial optimizationtool-integrated agentsheuristic discoverymulti-turn decision making
0
0 comments X

The pith

A 4B-parameter agent trained with agentic reinforcement learning matches or surpasses larger models in automatic heuristic design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AHD Agent as a framework that lets language models act as agents deciding dynamically whether to output a heuristic or call tools to pull specific evidence from the current solving state. Training happens through reinforcement learning on environments built by a new synthesis pipeline that supplies the signals needed for good decisions. The result is a compact 4B model that equals or beats existing methods built around much larger models while using far fewer evaluations. This holds across eight different combinatorial optimization domains, four of which the model never encountered during training. A reader would care because the work points to a practical route for making heuristic discovery less dependent on massive models and more autonomous.

Core claim

Framing automatic heuristic design as a multi-turn agentic process, where the model can choose at each step to generate a heuristic or invoke tools for targeted evidence, and then training the resulting policy with reinforcement learning on a novel environment synthesis pipeline, produces a 4B-parameter model whose performance matches or exceeds that of state-of-the-art baselines relying on substantially larger models while requiring significantly fewer evaluations on both training and held-out tasks.

What carries the argument

The AHD Agent: a tool-integrated multi-turn framework in which the language model proactively chooses between heuristic generation and tool invocation to retrieve state-dependent evidence.

If this is right

  • Compact models become sufficient for competitive automatic heuristic design.
  • Proactive tool use reduces the number of trials needed compared with passive generation.
  • Performance extends to tasks absent from the reinforcement learning phase.
  • The approach supplies a concrete trajectory toward fully autonomous heuristic discovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthesis-plus-RL pattern could be tested for training agents in other domains that require adaptive tool selection.
  • Lower evaluation counts may allow the method to scale to larger problem instances than current baselines can handle.
  • If the decision policy proves robust, future versions could reduce reliance on hand-designed prompts in heuristic search systems.

Load-bearing premise

The environment synthesis pipeline generates training signals that produce decision-making skills transferable to optimization tasks the agent has not seen during reinforcement learning.

What would settle it

Showing that the 4B agent needs more evaluations or performs worse than larger baselines on a fresh collection of held-out combinatorial optimization tasks would falsify the claim of generalizable performance.

Figures

Figures reproduced from arXiv: 2605.08756 by Haoze Lv, Ning Lu, Shengcai Liu, Ziang Zhou.

Figure 1
Figure 1. Figure 1: Traditional LLM-based AHD vs. our AHD Agent. Traditional AHD places the LLM inside a fixed loop. AHD Agent enables the LLM to design heuristics by actively calling tools, generating candidates, and performing evaluations. However, existing LLM-based AHD frameworks (e.g., EoH [9], ReEvo [10]) still face key limita￾tions. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: further shows that simply providing all available information (tools) to LLMs within these fixed workflows brings limited gains and may even hurt performance, suggesting that the key chal￾lenge is not information availability alone, but the lack of state-dependent mechanisms for acquiring and using relevant information. Additionally, existing frameworks typically use general-purpose LLMs that are not speci… view at source ↗
Figure 3
Figure 3. Figure 3: Demonstration of AHD Agent workflow. Given a problem description, a seed heuristic, and a set of tools, the model iteratively decides its next action based on the session history, which records all previous interactions. At each turn, it can call tools, generate and evaluate heuristics, and finally return the best heuristic. 3.1.2 Agent Loop [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scaling Effect of AHD Agent. (a) Inference-time scaling comparison. SR strategy outperforms PS strategy on two tasks. (b) Model scaling favors AHD Agent. Performance of AHD Agent increases as the model size increases from 30B to 397B parameters. 1 25 50 75 100 Evaluator calls 9 10 11 12 Best-so-far objective CVRP-ACO 1 25 50 75 100 Evaluator calls 13 14 15 OP-ACO EOH ReEvo MCTS-AHD AHD Agent AHD Agent w/ S… view at source ↗
Figure 5
Figure 5. Figure 5: Training curves during design. AHD Agent converges faster and achieves better per￾formance under larger evaluation budgets. 0 200 400 RL update step -0.5 0.0 0.5 1.0 Reward 0 200 400 RL update step 20 40 60 Number of Turn [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance changes as the number of training domains in￾creases. Cross-domain RL training increases the general AHD ca￾pability [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Objective-value distribution of the source heuristic pools used to generate the RL training [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: RL training diagnostics over 500 steps. Top-left: quality reward (higher is better). Top [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-domain train-side validation curves over RL training steps. Each panel reports the [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mean validation Gap (%) across the three reported problem sizes as the GPT-series [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
read the original abstract

Automatic heuristic design (AHD) has emerged as a promising paradigm for solving NP-hard combinatorial optimization problems (COPs). Recent works show that large language models (LLMs), when integrated into well-designed frameworks (i.e., LLM-AHD), can autonomously discover high-performing heuristics. However, existing LLM-AHD frameworks typically treat LLMs as passive generators within fixed workflows, where the model generates heuristics from manually designed, limited context. Such context may fail to capture state-dependent information (e.g., specific failure modes), leading to inefficient trial-and-error exploration. To overcome these limitations, we propose AHD Agent, a novel tool-integrated, multi-turn framework that empowers LLMs to proactively decide whether to generate heuristics or invoke tools to retrieve targeted evidence from the solving environment. To effectively train such a dynamic decision-making agent, we introduce an agentic reinforcement learning (RL) system, which leverages a novel environment synthesis pipeline to optimize a compact model's generalizable AHD capabilities. Experiments across eight diverse domains, including four held-out tasks, demonstrate that our 4B-parameter agent matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations. Model and inference scaling analysis further reveals that AHD Agent offers an effective trajectory toward truly autonomous heuristic design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces AHD Agent, a tool-integrated multi-turn framework for automatic heuristic design (AHD) in NP-hard combinatorial optimization problems. LLMs are trained via agentic reinforcement learning on a novel environment synthesis pipeline to enable proactive decisions between generating heuristics and invoking tools for targeted evidence retrieval from the solving environment. The central claim is that the resulting 4B-parameter agent matches or surpasses state-of-the-art baselines (using much larger models) across eight diverse domains, including four held-out tasks, while requiring significantly fewer evaluations.

Significance. If the empirical results hold under rigorous scrutiny, the work could advance LLM-based AHD by moving beyond passive generation in fixed workflows to an active, tool-using agent trained with RL. The compact model size combined with reduced evaluations and apparent generalization to held-out tasks would indicate a practical path toward more efficient autonomous heuristic discovery. The environment synthesis pipeline for creating RL training signals is a potentially valuable technical contribution if it demonstrably supports out-of-distribution performance.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts that the 4B-parameter agent 'matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations' across eight domains including four held-out tasks. However, it supplies no details on the baselines, metrics, number of independent runs, statistical tests, ablation studies, or experimental protocols. This prevents assessment of whether the data support the central performance claim.
  2. [Experiments section] Environment synthesis pipeline and held-out tasks (Experiments section): The generalizability claim to four held-out tasks is load-bearing and depends on the novel environment synthesis pipeline producing training signals that enable out-of-distribution AHD decision-making. The manuscript must provide explicit distribution-shift metrics or diversity controls demonstrating that synthesized environments differ structurally from the held-out set (e.g., in problem classes, constraint types, or instance distributions); absent this, the reported performance could reflect training-distribution overlap rather than the agentic multi-turn framework.
minor comments (1)
  1. [Abstract] The abstract uses 'significantly fewer evaluations' without any quantification or comparison numbers; adding approximate ratios or absolute figures would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that the 4B-parameter agent 'matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations' across eight domains including four held-out tasks. However, it supplies no details on the baselines, metrics, number of independent runs, statistical tests, ablation studies, or experimental protocols. This prevents assessment of whether the data support the central performance claim.

    Authors: We agree that the abstract is high-level and omits specific experimental details due to length constraints. In the revised manuscript, we will expand the abstract to briefly name the primary baselines (including model sizes), the main metrics (solution quality and evaluation counts), and note that results are reported as averages over multiple independent runs with statistical tests detailed in the Experiments section. Full protocols, ablations, and significance results remain in the main text. This change directly addresses the concern while preserving abstract conciseness. revision: yes

  2. Referee: [Experiments section] Environment synthesis pipeline and held-out tasks (Experiments section): The generalizability claim to four held-out tasks is load-bearing and depends on the novel environment synthesis pipeline producing training signals that enable out-of-distribution AHD decision-making. The manuscript must provide explicit distribution-shift metrics or diversity controls demonstrating that synthesized environments differ structurally from the held-out set (e.g., in problem classes, constraint types, or instance distributions); absent this, the reported performance could reflect training-distribution overlap rather than the agentic multi-turn framework.

    Authors: We acknowledge the need for explicit evidence of distribution shift to support the held-out task claims. The environment synthesis pipeline (Section 3.2) generates training environments by varying problem parameters, constraint structures, and instance features across domains. In the revision, we add quantitative distribution-shift analyses, including comparisons of problem classes, constraint types, and instance distributions (via new tables and divergence metrics) between synthesized training data and the four held-out tasks. These demonstrate structural differences and reinforce that gains arise from the agentic RL framework. An ablation on the pipeline's role in generalization is also included. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical evaluation of a novel pipeline, not self-referential definitions or fits

full rationale

The paper presents an empirical method (AHD Agent) whose central claims are performance outcomes measured on eight domains including four held-out tasks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or described methodology. The environment synthesis pipeline is introduced as an external innovation whose value is tested by downstream RL training and generalization metrics rather than being defined in terms of the target results. No load-bearing step reduces a prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified effectiveness of the environment synthesis pipeline and the assumption that RL can instill generalizable tool-use policies in LLMs for heuristic design.

axioms (1)
  • domain assumption LLMs can learn effective dynamic decision policies for tool invocation versus generation through reinforcement learning on synthesized environments.
    This underpins the agentic RL training system introduced to optimize the AHD agent.

pith-pipeline@v0.9.0 · 5529 in / 1259 out tokens · 62717 ms · 2026-05-12T01:10:37.206351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 7 internal anchors

  1. [1]

    Traveling salesman problem: an overview of ap- plications, formulations, and solution approaches,

    R. Matai, S. P . Singh, and M. L. Mittal, “Traveling salesman problem: an overview of ap- plications, formulations, and solution approaches,” Traveling salesman problem, theory and applications, vol. 1, no. 1, pp. 1–25, 2010

  2. [2]

    Heuristic algorithm for scheduling in a flowshop to minimize total flowtime,

    C. Rajendran, “Heuristic algorithm for scheduling in a flowshop to minimize total flowtime,” International Journal of Production Economics , vol. 29, no. 1, pp. 65–73, 1993

  3. [3]

    Heuristic and meta-heuristic algorithms and their relevance to the real world: a survey,

    S. Desale, A. Rasool, S. Andhale, and P . Rane, “Heuristic and meta-heuristic algorithms and their relevance to the real world: a survey,” Int. J. Comput. Eng. Res. Trends , vol. 351, no. 5, pp. 2349–7084, 2015

  4. [4]

    A classification of hyper-heuristic approaches,

    E. K. Burke, M. Hyde, G. Kendall, G. Ochoa, E. Özcan, and J. R. Woodward, “A classification of hyper-heuristic approaches,” in Handbook of metaheuristics. Springer, 2010, pp. 449–468

  5. [5]

    W. B. Langdon and R. Poli, F oundations of genetic programming. Springer, 2002, vol. 90

  6. [6]

    Explainable artificial intelligence by genetic programming: A survey,

    Y . Mei, Q. Chen, A. Lensen, B. Xue, and M. Zhang, “Explainable artificial intelligence by genetic programming: A survey,” IEEE Transactions on Evolutionary Computation , vol. 27, no. 3, pp. 621–641, 2022

  7. [7]

    Mathematical discoveries from program search with large language models,

    B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P . Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P . Wang, O. Fawziet al., “Mathematical discoveries from program search with large language models,” Nature, vol. 625, no. 7995, pp. 468–475, 2024

  8. [8]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P .-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. Ruiz, A. Mehrabian et al., “Alphaevolve: A coding agent for scientific and algorithmic discovery,”arXiv preprint arXiv:2506.13131, 2025

  9. [9]

    Evolution of heuristics: Towards efficient automatic algorithm design using large language model,

    F. Liu, X. Tong, M. Y uan, X. Lin, F. Luo, Z. Wang, Z. Lu, and Q. Zhang, “Evolution of heuristics: Towards efficient automatic algorithm design using large language model,” arXiv preprint arXiv:2401.02051, 2024

  10. [10]

    Reevo: Large language models as hyper-heuristics with reflective evolution,

    H. Y e, J. Wang, Z. Cao, F. Berto, C. Hua, H. Kim, J. Park, and G. Song, “Reevo: Large language models as hyper-heuristics with reflective evolution,”Advances in neural information processing systems, vol. 37, pp. 43 571–43 608, 2024

  11. [11]

    Deepseek-v4: Towards highly efficient million-token context intelligence,

    DeepSeek-AI, “Deepseek-v4: Towards highly efficient million-token context intelligence,” 2026

  12. [12]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Z. Wang, K. Wang, Q. Wang, P . Zhang, L. Li, Z. Y ang, K. Y u, M. N. Nguyen, L. Liu, E. Gottlieb et al. , “RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforcement learning,” arXiv preprint arXiv:2504.20073, 2025

  13. [13]

    Monte carlo tree search for comprehensive explo- ration in llm-based automatic heuristic design,

    Z. Zheng, Z. Xie, Z. Wang, and B. Hooi, “Monte carlo tree search for comprehensive explo- ration in llm-based automatic heuristic design,” arXiv preprint arXiv:2501.08603, 2025

  14. [14]

    Eoh-s: Evolution of heuristic set using llms for automated heuristic design,

    F. Liu, Y . Liu, Q. Zhang, T. Xialiang, and M. Y uan, “Eoh-s: Evolution of heuristic set using llms for automated heuristic design,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 43, 2026, pp. 37 090–37 098. 10

  15. [15]

    Hsevo: Elevating automatic heuristic design with diversity-driven harmony search and genetic algorithm using llms,

    P . V . T. Dat, L. Doan, and H. T. T. Binh, “Hsevo: Elevating automatic heuristic design with diversity-driven harmony search and genetic algorithm using llms,” inProceedings of the AAAI Conference on Artificial Intelligence , vol. 39, no. 25, 2025, pp. 26 931–26 938

  16. [16]

    Generalizable heuristic generation through LLMs with meta-optimization,

    Y . Shi, J. Zhou, W. Song, J. Bi, Y . Wu, Z. Cao, and J. Zhang, “Generalizable heuristic generation through LLMs with meta-optimization,” in The F ourteenth International Conference on Learning Representations , 2026. [Online]. Available: https://openreview.net/ forum?id=tIQZ7pVN6S

  17. [17]

    VRPAgent: LLM-driven discovery of heuristic operators for vehicle routing problems

    A. Hottung, F. Berto, C. Hua, N. G. Zepeda, D. Wetzel, M. Römer, H. Y e, D. Zago, M. Poli, S. Massaroli et al., “Vrpagent: Llm-driven discovery of heuristic operators for vehicle routing problems,” arXiv preprint arXiv:2510.07073, 2025

  18. [18]

    Llm-assisted automatic memetic algorithm for lot-streaming hybrid job shop scheduling with variable sublots,

    R. Li, L. Wang, H. Sang, L. Y ao, and L. Pan, “Llm-assisted automatic memetic algorithm for lot-streaming hybrid job shop scheduling with variable sublots,” IEEE Transactions on Evolutionary Computation, 2025

  19. [19]

    Dhevo: Data-algorithm based heuristic evolution for generalizable milp solving,

    Z. Zhang, S. Li, C. Li, F. Liu, M. Chen, K. Li, T. Zhong, B. An, and P . Liu, “Dhevo: Data-algorithm based heuristic evolution for generalizable milp solving,” arXiv preprint arXiv:2507.15615, 2025

  20. [20]

    Dasathco: Data-aware sat heuristics combinations optimization via large language models,

    M. Chen and G. Li, “Dasathco: Data-aware sat heuristics combinations optimization via large language models,” arXiv preprint arXiv:2509.12602, 2025

  21. [21]

    Llm-driven instance-specific heuristic generation and selection,

    S. Zhang, S. Liu, N. Lu, J. Wu, J. Liu, Y .-S. Ong, and K. Tang, “Llm-driven instance-specific heuristic generation and selection,” arXiv preprint arXiv:2506.00490, 2026

  22. [22]

    Algorithm discovery with LLMs: Evolutionary search meets reinforcement learning

    A. Surina, A. Mansouri, L. Quaedvlieg, A. Seddas, M. Viazovska, E. Abbe, and C. Gulcehre, “Algorithm discovery with llms: Evolutionary search meets reinforcement learning,” arXiv preprint arXiv:2504.05108, 2025

  23. [23]

    Refining hybrid genetic search for CVRP via reinforcement learning-finetuned LLM,

    R. Zhu, C. Zhang, and Z. Cao, “Refining hybrid genetic search for CVRP via reinforcement learning-finetuned LLM,” in The F ourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=aITKXFeivk

  24. [24]

    CALM: Co-evolution of algorithms and language model for automatic heuristic design,

    Z. Huang, W. Wu, K. Wu, W.-B. Lee, and J. Wang, “CALM: Co-evolution of algorithms and language model for automatic heuristic design,” in The F ourteenth International Conference on Learning Representations , 2026. [Online]. Available: https: //openreview.net/forum?id=x6bG2Hoqdf

  25. [25]

    LLMOPT: learning to de- fine and solve general optimization problems from scratch,

    C. Jiang, X. Shu, H. Qian, X. Lu, J. Zhou, A. Zhou, and Y . Y u, “LLMOPT: learning to de- fine and solve general optimization problems from scratch,” in Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , Singapore, Singapore, 2025

  26. [26]

    Large language models as evolutionary optimizers,

    S. Liu, C. Chen, X. Qu, K. Tang, and Y .-S. Ong, “Large language models as evolutionary optimizers,” in 2024 IEEE Congress on Evolutionary Computation (CEC) . IEEE, 2024, pp. 1–8

  27. [27]

    CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,

    K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin, “CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2024, pp. 13 643–13 658

  28. [28]

    Y ou only look at screens: Multimodal chain-of-action agents,

    Z. Zhang and A. Zhang, “Y ou only look at screens: Multimodal chain-of-action agents,” in Findings of the Association for Computational Linguistics ACL 2024 , 2024, pp. 3132–3149

  29. [29]

    The dawn of GUI agent: A preliminary case study with claude 3.5 computer use,

    S. Hu, M. Ouyang, D. Gao, and M. Z. Shou, “The dawn of GUI agent: A preliminary case study with claude 3.5 computer use,” arXiv preprint arXiv:2411.10323, 2024

  30. [30]

    V oyager: An open-ended embodied agent with large language models,

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,” Transactions on Machine Learning Research , 2024. [Online]. Available: https://openreview.net/forum?id=ehfRiF0R3a 11

  31. [31]

    RT-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Y u, S. Xu, P . Xu, T. Xiao, F. Xia, J. Wu, P . Wohlhart, S. Welker, A. Wahid et al. , “RT-2: Vision-language-action models transfer web knowledge to robotic control,” in Conference on Robot Learning . PMLR, 2023, pp. 2165–2183

  32. [32]

    ReAct: Synergizing reasoning and acting in language models,

    S. Y ao, J. Zhao, D. Y u, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in The Eleventh International Conference on Learning Representations , 2023. [Online]. Available: https: //openreview.net/forum?id=WE_vluYUL-X

  33. [33]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Y ao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems , vol. 36, 2024

  34. [34]

    Mobile- Agent-v2: Mobile device operation assistant with effective navigation via multi-agent collabo- ration,

    J. Wang, H. Xu, H. Jia, X. Zhang, M. Y an, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile- Agent-v2: Mobile device operation assistant with effective navigation via multi-agent collabo- ration,” Advances in Neural Information Processing Systems , vol. 37, pp. 2686–2710, 2024

  35. [35]

    Cradle: Empowering foundation agents towards general computer control,

    W. Tan, W. Zhang, X. Xu, H. Xia, G. Ding, B. Li, B. Zhou, J. Y ue, J. Jiang, Y . Liet al., “Cradle: Empowering foundation agents towards general computer control,” inNeurIPS 2024 Workshop on Open-World Agents, 2024

  36. [36]

    Large language models can be guided to evade ai-generated text detection,

    N. Lu, S. Liu, R. He, Y . Ong, Q. Wang, and K. Tang, “Large language models can be guided to evade ai-generated text detection,” TMLR, 2024

  37. [37]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Y u, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Information Processing Systems , vol. 36, pp. 68 539–68 551, 2023

  38. [38]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

    T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Y u, “OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” in The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2024

  39. [39]

    Why Agents Compromise Safety Under Pressure

    H. Jiang and K. Tang, “Why agents compromise safety under pressure,” arXiv preprint arXiv:2603.14975, 2026

  40. [40]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    B. Jin, H. Zeng, Z. Y ue, J. Y oon, S. Arik, D. Wang, H. Zamani, and J. Han, “Search-r1: Train- ing llms to reason and leverage search engines with reinforcement learning,” arXiv preprint arXiv:2503.09516, 2025

  41. [41]

    Kevin: Multi-turn RL for generating CUDA kernels,

    C. Baronio, P . Marsella, B. Pan, S. Guo, and S. Alberti, “Kevin: Multi-turn RL for generating CUDA kernels,” in The F ourteenth International Conference on Learning Representations , 2026

  42. [42]

    Is PRM necessary? problem-solving RL implicitly induces PRM capability in LLMs,

    Z. Feng, Q. Chen, N. Lu, Y . Li, S. Cheng, S. Peng, D. Tang, S. Liu, and Z. Zhang, “Is PRM necessary? problem-solving RL implicitly induces PRM capability in LLMs,” in NeurIPS, 2025

  43. [43]

    Train at moving edge: Online- verified prompt selection for efficient rl training of large reasoning model,

    J. Wu, N. Lu, S. Liu, K. Wang, Y . Y ang, L. Qing, and K. Tang, “Train at moving edge: Online- verified prompt selection for efficient rl training of large reasoning model,” arXiv preprint arXiv:2603.25184, 2026

  44. [44]

    Reasoning-aligned perception decoupling for scalable multi-modal reasoning,

    Y . Gou, K. Chen, Z. Liu, L. HONG, X. Jin, Z. Li, J. Kwok, and Y . Zhang, “Reasoning-aligned perception decoupling for scalable multi-modal reasoning,” in The F ourteenth International Conference on Learning Representations, 2026

  45. [45]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimiza- tion algorithms,” arXiv preprint arXiv:1707.06347, 2017

  46. [46]

    Mastering the game of go without human knowledge,

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017. 12

  47. [47]

    Buy 4 reinforce samples, get a baseline for free!

    W. Kool, H. van Hoof, and M. Welling, “Buy 4 reinforce samples, get a baseline for free!” in ICLR 2019 Workshop, 2019

  48. [48]

    The traveling salesman problem: a guided tour of combinatorial optimization,

    E. L. Lawler, “The traveling salesman problem: a guided tour of combinatorial optimization,” Wiley-Interscience Series in Discrete Mathematics, 1985

  49. [49]

    Llm4ad: A platform for algorithm design with large language model

    F. Liu, R. Zhang, Z. Xie, R. Sun, K. Li, Q. Hu, P . Guo, X. Lin, X. Tong, M. Y uan et al., “Llm4ad: A platform for algorithm design with large language model,” arXiv preprint arXiv:2412.17287, 2024

  50. [50]

    Ant colony optimization,

    M. Dorigo, M. Birattari, and T. Stutzle, “Ant colony optimization,” IEEE computational intel- ligence magazine, vol. 1, no. 4, pp. 28–39, 2006

  51. [51]

    Evolve cost-aware acquisition functions using large language models,

    Y . Y ao, F. Liu, J. Cheng, and Q. Zhang, “Evolve cost-aware acquisition functions using large language models,” in International Conference on Parallel Problem Solving from Nature. Springer, 2024, pp. 374–390

  52. [52]

    An analysis of several heuristics for the traveling salesman problem,

    D. J. Rosenkrantz, R. E. Stearns, and P . M. Lewis, II, “An analysis of several heuristics for the traveling salesman problem,” SIAM journal on computing , vol. 6, no. 3, pp. 563–581, 1977

  53. [53]

    Improving ant colony optimization efficiency for solving large tsp in- stances,

    R. Skinderowicz, “Improving ant colony optimization efficiency for solving large tsp in- stances,” Applied Soft Computing, vol. 120, p. 108653, 2022

  54. [54]

    A dynamic space reduction ant colony optimization for capacitated vehicle routing problem,

    J. Cai, P . Wang, S. Sun, and H. Dong, “A dynamic space reduction ant colony optimization for capacitated vehicle routing problem,” Soft Computing, vol. 26, no. 17, pp. 8745–8756, 2022

  55. [55]

    Acs-ophs: Ant colony system for the orienteering problem with hotel selection,

    S. Sohrabi, K. Ziarati, and M. Keshtkaran, “Acs-ophs: Ant colony system for the orienteering problem with hotel selection,” EURO Journal on Transportation and Logistics , vol. 10, p. 100036, 2021

  56. [56]

    Hybrid ant colony optimization algorithm for multiple knapsack problem,

    S. Fidanova, “Hybrid ant colony optimization algorithm for multiple knapsack problem,” in 2020 5th IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE). IEEE, 2020, pp. 1–5

  57. [57]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford et al., “Gpt-4o system card,” arXiv preprint arXiv:2410.21276, 2024

  58. [58]

    Qwen3 technical report,

    Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505. 09388

  59. [59]

    Taking the human out of the loop: A review of bayesian optimization,

    B. Shahriari, K. Swersky, Z. Wang, R. P . Adams, and N. De Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE , vol. 104, no. 1, pp. 148–175, 2015

  60. [60]

    On bayesian methods for seeking the extremum,

    J. Mo ˇckus, “On bayesian methods for seeking the extremum,” in IFIP Technical Conference on Optimization Techniques. Springer, 1974, pp. 400–404

  61. [61]

    Practical bayesian optimization of machine learning algorithms,

    J. Snoek, H. Larochelle, and R. P . Adams, “Practical bayesian optimization of machine learning algorithms,” Advances in neural information processing systems , vol. 25, 2012

  62. [62]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu et al. , “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

  63. [63]

    An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,

    K. Helsgaun, “An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,” Roskilde: Roskilde University , vol. 12, pp. 966–980, 2017

  64. [64]

    Pyvrp: A high-performance vrp solver package,

    N. A. Wouda, L. Lan, and W. Kool, “Pyvrp: A high-performance vrp solver package,” IN- FORMS Journal on Computing , vol. 36, no. 4, pp. 943–955, 2024

  65. [65]

    A revisited branch-and-cut algorithm for large-scale orienteering problems,

    G. Kobeaga, J. Rojas-Delgado, M. Merino, and J. A. Lozano, “A revisited branch-and-cut algorithm for large-scale orienteering problems,” European Journal of Operational Research , vol. 313, no. 1, pp. 44–68, 2024

  66. [66]

    Or-tools,

    L. Perron and V . Furnon, “Or-tools,” Google. [Online]. Available: https://developers.google. com/optimization/ 13 A Details of Problem Domains A.1 Problem Domain Definitions We evaluate on eight problem domains spanning combinatorial and continuous optimization. Each subsection below states the mathematical formulation, the training/validation instance si...

  67. [67]

    InstanceAnalysis: summarize structural properties of the training instances, such as spacing, clustering, density, boundary statistics, and task-specific attributes when available

  68. [68]

    Optimal” and “Baseline heuristic

    ASTNoveltyAnalyzer: compare the AST structure of a candidate against previously evaluated candidates. This interface is used only as a novelty checkpoint; final ranking is always deter- mined by train evaluation. Interaction rules. Use diagnostic feedback and train evaluation results to revise the code over multiple turns. Do not submit the initial code un...