RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

Ant\'on Asla Manz\'arraga

arxiv: 2606.20142 · v1 · pith:7GAWGH2Tnew · submitted 2026-06-18 · 💻 cs.AI · cs.MA

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

Ant\'on Asla Manz\'arraga This is my paper

Pith reviewed 2026-06-26 17:42 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords reasoning agentmetaheuristic controloperational memorybounded interventionspolicy consolidationvehicle routingsearch behaviorcontinuous learning

0 comments

The pith

A reasoning agent can control a metaheuristic's search by observing operational memory, testing bounded interventions, and consolidating policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RACL as a layer that sits above an existing optimizer and lets a reasoning agent manage its internal search behavior. The agent watches operational memory logs, reasons about past runs, creates limited hypotheses for changes, tries those changes under guardrails, measures the results, keeps what works, and explains the choices. The vehicle routing testbed serves only to demonstrate the method; the core contribution is a repeatable process for discovering algorithmic control rules rather than any new solver or domain rules. A reader would care because the approach aims to make metaheuristic behavior adapt continuously and explainably without rewriting the optimizer or embedding extra domain knowledge by hand. Experiments report that RACL matches or beats an operational-memory baseline in every feasible trial and matches or beats a stagnation-triggered baseline in 18 of 21 trials, with a small average cost reduction.

Core claim

RACL places a reasoning agent above an existing optimizer without replacing the optimizer or changing business constraints. The agent controls search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies, and explaining decisions. In the vehicle routing experiments, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL versus STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cos

What carries the argument

The Reasoning-Agent Control Layer, which lets an agent observe operational memory, reason over logs, propose and test bounded interventions on search behavior, and consolidate the resulting policies.

If this is right

RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases.
RACL improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases.
Average RACL versus STP cost delta reaches -0.641%.
In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without material computational overhead.
The method works by having the agent consolidate useful policies after testing interventions under guardrails.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same observation-and-intervention loop could be applied to metaheuristics other than those used for vehicle routing.
Consolidating policies in an explicit, explainable form may reduce reliance on manually written control heuristics.
Guardrails on interventions could make automated policy changes acceptable in settings where unexplained changes are disallowed.
Replacing the initial reasoning model with a faster one might preserve the same policy quality at lower per-step cost.

Load-bearing premise

That the reasoning agent's observations of operational memory are sufficient to formulate bounded hypotheses whose interventions can be safely tested and whose outcomes reliably indicate useful control policies, without the need for domain-specific routing knowledge beyond what the agent extracts from logs.

What would settle it

A controlled trial in which the agent, given full operational memory logs, repeatedly proposes interventions that produce no improvement or produce degradation, or in which the added reasoning time exceeds any search gains.

read the original abstract

This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer's internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisions. The experiment uses vehicle routing as a testbed, but the contribution is not a new routing solver, a particular ALNS configuration or a specific set of routing rules. The contribution is the RACL method: a way for a reasoning agent to discover, validate, consolidate and explain algorithmic control rules for a metaheuristic. In the current experimental setting, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL vs STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without showing material computational overhead. During the proof-of-concept, Codex was used as an in-the-loop reasoning agent observing executions, interpreting logs and proposing live bounded interventions. The policy proxy was later used only to make quantitative evaluation reproducible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RACL frames a reasoning agent as a control layer for metaheuristics with modest VRP gains but thin experimental documentation.

read the letter

The core idea is a reasoning agent layered on top of an unmodified metaheuristic that watches operational memory, forms bounded hypotheses about interventions, tests them under guardrails, and consolidates the ones that work into policies. The paper positions this as a general method rather than a new vehicle-routing solver, and the experiments use VRP instances to show RACL tying or beating an operational-memory baseline in all 21 feasible cases and a stagnation-triggered baseline in 18 of 21, with a small average cost improvement.

What stands out is the clean separation between the agent and the optimizer: the agent does not rewrite constraints or the core search logic, and it produces explanations for its choices. Using Codex for live hypothesis generation during the proof-of-concept and then switching to a fixed policy proxy for the quantitative runs is a reasonable way to keep the evaluation reproducible.

The main limitation is the thin reporting on the experiments themselves. The abstract states concrete win counts and deltas but gives no information on how the instances were generated, how many independent runs were performed, what variance was observed, or which statistical tests support the claims. Without those details it is difficult to judge whether the reported improvements are stable. The stress-test concern about whether the operational-memory logs actually contain enough state for the agent to form useful hypotheses without implicitly importing domain knowledge is also left unaddressed; the paper does not analyze the information content of the logs or test whether the agent would succeed with deliberately impoverished observations.

This work is aimed at researchers who want to combine language-model reasoning with existing optimization code rather than replace the optimizer. Someone already working on adaptive metaheuristics or LLM-augmented OR tools could extract useful framing from it. The idea is distinct enough and the proof-of-concept concrete enough that it should go to peer review so referees can examine the full experimental protocol and the exact log schema the agent sees.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RACL, a Reasoning-Agent Control Layer placed above an existing metaheuristic optimizer. The agent observes operational memory, reasons over past behavior, formulates bounded hypotheses, tests interventions under guardrails, consolidates policies, and explains decisions, without replacing the optimizer or altering business constraints. Vehicle routing serves as the testbed. The central empirical claim is that RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and a non-reasoning Stagnation-Triggered Policy in 18 of 21 cases, with an average cost delta of -0.641% versus the latter; larger gains (-8.337% vs Fixed, -1.605% vs STP) appear in a Sevilla-9/10 runtime sample. A policy proxy derived from Codex runs is used for reproducible quantitative evaluation.

Significance. If the reported gains prove robust, the work offers a concrete mechanism for inserting reasoning agents into metaheuristic control loops while preserving the underlying solver and emphasizing reproducibility, guardrails, and explainability. The explicit separation of the control layer from domain-specific routing rules and the use of a policy proxy for evaluation are positive design choices that could support broader applicability.

major comments (3)

[Abstract] Abstract: the central claim that RACL improves or ties baselines in 21/21 and 18/21 cases with a -0.641% average delta is presented without any description of instance generation procedure, number of independent runs, variance across runs, or statistical testing; these omissions directly undermine verifiability of the consistency and magnitude of the reported gains.
[Method (RACL mechanism)] Method description of the RACL loop: the statement that the agent 'observes operational memory, reasons over past behavior, formulates bounded hypotheses' does not specify the log schema, fields, or information content made available to the agent. This detail is load-bearing for the claim that interventions are derived solely from logs without implicit importation of VRP-specific structure.
[Evaluation Setup] Evaluation section: the transition from live Codex in-the-loop reasoning to the policy proxy used for all quantitative results is not described, including how hypotheses were consolidated into the proxy and whether the proxy faithfully reproduces the agent's live intervention behavior.

minor comments (2)

[Abstract / Results] Define the Operational Memory Policy and Stagnation-Triggered Policy explicitly (or cite their sections) when first used in the abstract and results tables.
[Results tables] Tables reporting cost deltas should include per-instance standard deviations or confidence intervals to allow assessment of effect stability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential value of the RACL approach. We address each major comment below and will revise the manuscript to improve verifiability and methodological transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that RACL improves or ties baselines in 21/21 and 18/21 cases with a -0.641% average delta is presented without any description of instance generation procedure, number of independent runs, variance across runs, or statistical testing; these omissions directly undermine verifiability of the consistency and magnitude of the reported gains.

Authors: We agree that the abstract should contain sufficient methodological context to support verifiability of the reported claims. In the revised manuscript we will expand the abstract to include a concise statement of the instance generation procedure, the number of independent runs, observed variance, and the statistical testing performed. Full details already appear in the Evaluation section; the abstract revision will make the central claims self-contained. revision: yes
Referee: [Method (RACL mechanism)] Method description of the RACL loop: the statement that the agent 'observes operational memory, reasons over past behavior, formulates bounded hypotheses' does not specify the log schema, fields, or information content made available to the agent. This detail is load-bearing for the claim that interventions are derived solely from logs without implicit importation of VRP-specific structure.

Authors: We accept that an explicit description of the log schema is required to substantiate the claim. The revised Method section will include a dedicated subsection that enumerates the log schema, all fields, and the precise information content passed to the agent at each step. This addition will confirm that all hypotheses and interventions are generated exclusively from the logged data. revision: yes
Referee: [Evaluation Setup] Evaluation section: the transition from live Codex in-the-loop reasoning to the policy proxy used for all quantitative results is not described, including how hypotheses were consolidated into the proxy and whether the proxy faithfully reproduces the agent's live intervention behavior.

Authors: We agree that the proxy construction process and its fidelity validation must be documented. The revised Evaluation Setup will add a new subsection that (i) describes the consolidation procedure applied to live Codex hypotheses, (ii) specifies the criteria used to form the proxy, and (iii) reports quantitative checks confirming that the proxy reproduces the live agent's intervention decisions at high fidelity. These details will be added without changing any reported numerical results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons only

full rationale

The paper introduces RACL as a control layer and reports direct empirical results (improvements or ties in 21/21 and 18/21 cases, average deltas) against named baselines on vehicle routing instances. No equations, derivations, or first-principles claims appear in the provided text. The central claims are statistical outcomes of controlled experiments rather than predictions that reduce to fitted inputs or self-citations by construction. The method description (observing memory, formulating hypotheses) is procedural and does not contain self-definitional or load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that operational memory logs contain enough signal for a general reasoning agent to discover useful control rules.

pith-pipeline@v0.9.1-grok · 5789 in / 1158 out tokens · 28708 ms · 2026-06-26T17:42:10.128744+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 6 canonical work pages

[1]

Case-based reasoning: Foundational issues, methodological variations, and system approaches.AI Communications, 7(1):39–52, 1994

Agnar Aamodt and Enric Plaza. Case-based reasoning: Foundational issues, methodological variations, and system approaches.AI Communications, 7(1):39–52, 1994

1994
[2]

Le, Mohammad Norouzi, and Samy Bengio

Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning, 2016. URLhttps://arxiv.org/abs/ 1611.09940

Pith/arXiv arXiv 2016
[3]

Machine learning for combinatorial optimization: A methodological tour d’horizon.European Journal of Operational Research, 290 (2):405–421, 2021

Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: A methodological tour d’horizon.European Journal of Operational Research, 290 (2):405–421, 2021. doi: 10.1016/j.ejor.2020.07.063. 9

work page doi:10.1016/j.ejor.2020.07.063 2021
[4]

Burke, Emma Hart, Graham Kendall, Jim Newall, Peter Ross, and Sonia Schulen- burg

Edmund K. Burke, Emma Hart, Graham Kendall, Jim Newall, Peter Ross, and Sonia Schulen- burg. Hyper-heuristics: An emerging direction in modern search technology. In Fred Glover and Gary A. Kochenberger, editors,Handbook of Metaheuristics, pages 457–474. Kluwer Academic Publishers, 2003

2003
[5]

Burke, Michel Gendreau, Matthew Hyde, Graham Kendall, Gabriela Ochoa, Ender Özcan, and Rong Qu

Edmund K. Burke, Michel Gendreau, Matthew Hyde, Graham Kendall, Gabriela Ochoa, Ender Özcan, and Rong Qu. Hyper-heuristics: A survey of the state of the art.Journal of the Operational Research Society, 64(12):1695–1724, 2013. doi: 10.1057/jors.2013.71

work page doi:10.1057/jors.2013.71 2013
[6]

Drake, Ahmed Kheiri, Ender Özcan, and Edmund K

John H. Drake, Ahmed Kheiri, Ender Özcan, and Edmund K. Burke. Recent advances in selection hyper-heuristics.European Journal of Operational Research, 285(2):405–428, 2020. doi: 10.1016/j.ejor.2019.07.073

work page doi:10.1016/j.ejor.2019.07.073 2020
[7]

Graph re- inforcement learning for operator selection in the alns metaheuristic, 2023

Syu-Ning Johnn, Victor-Alexandru Darvariu, Julia Handl, and Joerg Kalcsics. Graph re- inforcement learning for operator selection in the alns metaheuristic, 2023. URL https: //arxiv.org/abs/2302.14678

arXiv 2023
[8]

Kolodner.Case-Based Reasoning

Janet L. Kolodner.Case-Based Reasoning. Morgan Kaufmann, 1993

1993
[9]

An overview and experimental study of learning-based optimization algorithms for vehicle routing problem, 2021

Bingjie Li, Guohua Wu, Yongming He, Mingfeng Fan, and Witold Pedrycz. An overview and experimental study of learning-based optimization algorithms for vehicle routing problem, 2021. URLhttps://arxiv.org/abs/2107.07076

arXiv 2021
[10]

Evolution of heuristics: Towards efficient automatic algorithm design using large language model, 2024

Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model, 2024. URLhttps://arxiv.org/abs/2401.02051

arXiv 2024
[11]

Large neighborhood search

David Pisinger and Stefan Ropke. Large neighborhood search. In Michel Gendreau and Jean-Yves Potvin, editors,Handbook of Metaheuristics, pages 399–419. Springer, 2010. doi: 10.1007/978-1-4419-1665-5_13

work page doi:10.1007/978-1-4419-1665-5_13 2010
[12]

Online control of adaptive large neighborhood search using deep reinforcement learning, 2022

Robbert Reijnen, Yingqian Zhang, Hoong Chuin Lau, and Zaharah Bukhsh. Online control of adaptive large neighborhood search using deep reinforcement learning, 2022. URLhttps: //arxiv.org/abs/2211.00759

arXiv 2022
[13]

Nature625, 7995 (01 Jan 2024), 468–475

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625:468–475, 2024. doi: 10.1038/s41586-023-06924-6

work page doi:10.1038/s41586-023-06924-6 2024
[14]

An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows.Transportation Science, 40(4):455–472, 2006

Stefan Ropke and David Pisinger. An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows.Transportation Science, 40(4):455–472, 2006. doi: 10.1287/trsc.1050.0135

work page doi:10.1287/trsc.1050.0135 2006
[15]

Llamea: A large language model evolutionary algorithm for automatically generating metaheuristics, 2024

Niki van Stein and Thomas Bäck. Llamea: A large language model evolutionary algorithm for automatically generating metaheuristics, 2024. URLhttps://arxiv.org/abs/2405.20132

arXiv 2024
[16]

Reevo: Large language models as hyper-heuristics with reflective evolution, 2024

Haoran Ye, Jiarui Wang, Zhiguang Cao, Federico Berto, Chuanbo Hua, Haeyeon Kim, Jinkyoo Park, and Guojie Song. Reevo: Large language models as hyper-heuristics with reflective evolution, 2024. URLhttps://arxiv.org/abs/2402.01145. 10

arXiv 2024

[1] [1]

Case-based reasoning: Foundational issues, methodological variations, and system approaches.AI Communications, 7(1):39–52, 1994

Agnar Aamodt and Enric Plaza. Case-based reasoning: Foundational issues, methodological variations, and system approaches.AI Communications, 7(1):39–52, 1994

1994

[2] [2]

Le, Mohammad Norouzi, and Samy Bengio

Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning, 2016. URLhttps://arxiv.org/abs/ 1611.09940

Pith/arXiv arXiv 2016

[3] [3]

Machine learning for combinatorial optimization: A methodological tour d’horizon.European Journal of Operational Research, 290 (2):405–421, 2021

Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: A methodological tour d’horizon.European Journal of Operational Research, 290 (2):405–421, 2021. doi: 10.1016/j.ejor.2020.07.063. 9

work page doi:10.1016/j.ejor.2020.07.063 2021

[4] [4]

Burke, Emma Hart, Graham Kendall, Jim Newall, Peter Ross, and Sonia Schulen- burg

Edmund K. Burke, Emma Hart, Graham Kendall, Jim Newall, Peter Ross, and Sonia Schulen- burg. Hyper-heuristics: An emerging direction in modern search technology. In Fred Glover and Gary A. Kochenberger, editors,Handbook of Metaheuristics, pages 457–474. Kluwer Academic Publishers, 2003

2003

[5] [5]

Burke, Michel Gendreau, Matthew Hyde, Graham Kendall, Gabriela Ochoa, Ender Özcan, and Rong Qu

Edmund K. Burke, Michel Gendreau, Matthew Hyde, Graham Kendall, Gabriela Ochoa, Ender Özcan, and Rong Qu. Hyper-heuristics: A survey of the state of the art.Journal of the Operational Research Society, 64(12):1695–1724, 2013. doi: 10.1057/jors.2013.71

work page doi:10.1057/jors.2013.71 2013

[6] [6]

Drake, Ahmed Kheiri, Ender Özcan, and Edmund K

John H. Drake, Ahmed Kheiri, Ender Özcan, and Edmund K. Burke. Recent advances in selection hyper-heuristics.European Journal of Operational Research, 285(2):405–428, 2020. doi: 10.1016/j.ejor.2019.07.073

work page doi:10.1016/j.ejor.2019.07.073 2020

[7] [7]

Graph re- inforcement learning for operator selection in the alns metaheuristic, 2023

Syu-Ning Johnn, Victor-Alexandru Darvariu, Julia Handl, and Joerg Kalcsics. Graph re- inforcement learning for operator selection in the alns metaheuristic, 2023. URL https: //arxiv.org/abs/2302.14678

arXiv 2023

[8] [8]

Kolodner.Case-Based Reasoning

Janet L. Kolodner.Case-Based Reasoning. Morgan Kaufmann, 1993

1993

[9] [9]

An overview and experimental study of learning-based optimization algorithms for vehicle routing problem, 2021

Bingjie Li, Guohua Wu, Yongming He, Mingfeng Fan, and Witold Pedrycz. An overview and experimental study of learning-based optimization algorithms for vehicle routing problem, 2021. URLhttps://arxiv.org/abs/2107.07076

arXiv 2021

[10] [10]

Evolution of heuristics: Towards efficient automatic algorithm design using large language model, 2024

Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model, 2024. URLhttps://arxiv.org/abs/2401.02051

arXiv 2024

[11] [11]

Large neighborhood search

David Pisinger and Stefan Ropke. Large neighborhood search. In Michel Gendreau and Jean-Yves Potvin, editors,Handbook of Metaheuristics, pages 399–419. Springer, 2010. doi: 10.1007/978-1-4419-1665-5_13

work page doi:10.1007/978-1-4419-1665-5_13 2010

[12] [12]

Online control of adaptive large neighborhood search using deep reinforcement learning, 2022

Robbert Reijnen, Yingqian Zhang, Hoong Chuin Lau, and Zaharah Bukhsh. Online control of adaptive large neighborhood search using deep reinforcement learning, 2022. URLhttps: //arxiv.org/abs/2211.00759

arXiv 2022

[13] [13]

Nature625, 7995 (01 Jan 2024), 468–475

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625:468–475, 2024. doi: 10.1038/s41586-023-06924-6

work page doi:10.1038/s41586-023-06924-6 2024

[14] [14]

An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows.Transportation Science, 40(4):455–472, 2006

Stefan Ropke and David Pisinger. An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows.Transportation Science, 40(4):455–472, 2006. doi: 10.1287/trsc.1050.0135

work page doi:10.1287/trsc.1050.0135 2006

[15] [15]

Llamea: A large language model evolutionary algorithm for automatically generating metaheuristics, 2024

Niki van Stein and Thomas Bäck. Llamea: A large language model evolutionary algorithm for automatically generating metaheuristics, 2024. URLhttps://arxiv.org/abs/2405.20132

arXiv 2024

[16] [16]

Reevo: Large language models as hyper-heuristics with reflective evolution, 2024

Haoran Ye, Jiarui Wang, Zhiguang Cao, Federico Berto, Chuanbo Hua, Haeyeon Kim, Jinkyoo Park, and Guojie Song. Reevo: Large language models as hyper-heuristics with reflective evolution, 2024. URLhttps://arxiv.org/abs/2402.01145. 10

arXiv 2024