arxiv: 2602.20730 · v2 · submitted 2026-02-24 · 💻 cs.LG

Recognition: no theorem link

Rethinking Efficiency in Neural Combinatorial Optimization: Batched Preference Optimization with Mamba

Zhenxing Xu , Zeyuan Ma , Weidong Bao , Yan Zheng , Ji Wang , Zhiguang Cao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural combinatorial optimizationdirect preference optimizationMambatraveling salesman problemcapacitated vehicle routing problemefficiencysequence modeling

0 comments

The pith

ECO uses batched preference optimization and a Mamba backbone to reach top neural performance on TSP and CVRP while cutting memory and raising throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ECO as a training framework that makes efficiency the primary goal in neural combinatorial optimization. It separates solution generation from model updates through an initial supervised phase on known solutions followed by repeated direct preference optimization over batches of candidates produced by the current policy. A mixed Mamba encoder-decoder replaces standard attention to limit memory growth on long sequences. Local search improves the quality of training preference pairs but is never used once training ends. The resulting models match or exceed other neural methods on routing tasks while using less memory and delivering higher inference speed.

Core claim

By training a Mamba-based policy with supervised warm-up and then iterative batched direct preference optimization, where local search only shapes preference pairs during training, ECO produces solutions on TSP and CVRP that surpass prior neural baselines in quality while requiring substantially less memory and delivering higher throughput at inference time without any search procedure.

What carries the argument

The batched direct preference optimization pipeline paired with a mixed Mamba encoder-decoder that decouples candidate generation from gradient updates and limits memory scaling on long sequences.

If this is right

Neural solvers can reach high solution quality at deployment without invoking local search or other post-processing steps.
Memory usage grows more slowly with problem size because the Mamba backbone replaces quadratic attention.
Throughput rises because trajectory sampling and gradient steps are no longer interleaved at every iteration.
The same two-stage pipeline can be applied to other routing problems that admit fast local search oracles for training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Preference-based training may replace reinforcement-learning loops in other sequential optimization domains where expert solutions are available offline.
Mamba-style state-space models could become the default backbone for any combinatorial task whose instances exceed a few hundred nodes.
Decoupling sampling from updates might reduce the hardware demands of training large policies for structured prediction tasks beyond routing.

Load-bearing premise

That preference pairs built with local search only during training will yield a policy whose inference-time outputs remain competitive without any search on new problem instances.

What would settle it

Running ECO without local search at inference on TSP instances larger than those seen in training and finding solution quality below that of search-augmented neural baselines would falsify the performance claim.

read the original abstract

We study efficiency as a first-class objective in Neural Combinatorial Optimization (NCO) and present ECO, an efficient learning framework that combines batched preference optimization with a Mamba backbone. Instead of tightly interleaving every policy update with on-policy rollouts, ECO decouples trajectory generation from gradient updates through two stages: supervised warm-up on pre-computed solutions and iterative Direct Preference Optimization (DPO) on batched candidate sets generated by the current policy. We pair this learning pipeline with a mixed Mamba encoder-decoder that reduces memory growth on long sequences and improves hardware utilization. A local-search-guided bootstrapping strategy is further used during training to widen preference margins and stabilize iterative improvement. Importantly, local search is only used to construct stronger preference pairs during training and is never invoked at inference time. On TSP and CVRP, ECO achieves the strongest overall performance among the compared neural baselines while also delivering clear advantages in memory usage and throughput. We provide additional analysis on memory scaling, throughput, and the contribution of each design component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ECO's batched DPO plus Mamba setup gives a clean way to cut memory and raise throughput in neural TSP solvers while keeping local search out of inference, but the gains need tighter ablations to pin down.

read the letter

The main thing here is that ECO decouples trajectory generation from gradient steps by running supervised warm-up first, then iterative batched DPO on candidate sets from the current policy, all on a mixed Mamba encoder-decoder. Local search only builds stronger preference pairs during training and stays out at test time. That design choice is explicit and avoids the usual tight interleaving of rollouts and updates that hurts scaling in earlier neural solvers. The Mamba backbone is motivated by linear memory growth on longer sequences, which matches the practical needs of routing instances. On TSP and CVRP the paper reports the best results among the neural baselines plus measurable wins in memory and speed, which is the kind of outcome that matters for deployment. The two-stage pipeline looks straightforward to reproduce from the description. The soft spots are mostly around evidence strength. Without detailed tables showing error bars, instance-size scaling curves, and component ablations, it is hard to tell exactly how much the Mamba swap and the batched DPO each contribute versus the local-search bootstrap. The claim that the policy stays superior at pure inference rests on the preference margins holding up, and that could be sensitive to the quality of the training search or the specific instance distributions. Generalization beyond the reported sizes is asserted but not heavily tested in the abstract. This is for researchers working on practical neural solvers for logistics and scheduling who already know the standard NCO baselines. It deserves a serious referee because the pipeline is clearly described, the efficiency focus is timely, and the central design decisions are falsifiable with the right experiments.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces ECO, a framework for neural combinatorial optimization that pairs a Mamba encoder-decoder backbone with a two-stage batched preference optimization pipeline. Supervised warm-up on pre-computed solutions is followed by iterative Direct Preference Optimization (DPO) over batched candidate trajectories generated by the current policy; local search is used exclusively to construct stronger preference pairs during training and is never called at inference. On TSP and CVRP instances the method reports the strongest solution quality among neural baselines together with measurable reductions in memory footprint and gains in throughput, supported by scaling and ablation analyses.

Significance. If the reported performance and efficiency numbers hold under the stated experimental protocol, the work is significant for NCO because it treats memory and inference speed as first-class objectives rather than afterthoughts. The linear-time scaling of Mamba and the decoupling of on-policy rollouts via batched DPO directly address two recurring bottlenecks in the literature. Confining local search to the training phase only yields a purely neural inference procedure, which is a practical advantage for deployment. The explicit scaling and component-ablation studies provide concrete evidence for the design decisions.

minor comments (3)

[§4.2] §4.2, Table 2: the reported gap to the best neural baseline on CVRP-100 is given without standard deviation across the five random seeds; adding error bars would strengthen the claim of consistent superiority.
[§3.3] §3.3: the notation for the preference margin in the DPO loss (Eq. 7) re-uses the symbol β that was earlier defined for the Mamba state-space parameter; a distinct symbol or explicit disambiguation would avoid reader confusion.
[Figure 4] Figure 4: the memory-scaling plot would benefit from an additional curve showing peak GPU memory versus sequence length for a standard transformer baseline under identical batch size, to make the Mamba advantage visually quantitative.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the recognition of its significance for neural combinatorial optimization, and the recommendation for minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents ECO as a two-stage pipeline (supervised warm-up followed by iterative batched DPO) paired with a Mamba encoder-decoder, where local search is explicitly restricted to constructing training preference pairs and never used at inference. No equations, derivations, or parameter-fitting steps are described that reduce the claimed performance gains or efficiency advantages to quantities defined by the same data or by self-referential construction. The framework relies on standard DPO and Mamba components with empirical validation on TSP/CVRP instances; the central claims remain independent of any load-bearing self-citation or ansatz smuggled through prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, new axioms, or invented entities; the framework rests on standard DPO loss, Mamba state-space model assumptions, and the premise that local search improves preference pairs only in training.

axioms (1)

standard math Direct Preference Optimization loss and Mamba recurrence assumptions hold for the combinatorial trajectory setting
The method description invokes established DPO and Mamba components without deriving them.

pith-pipeline@v0.9.0 · 5491 in / 1260 out tokens · 13465 ms · 2026-05-15T20:14:03.059407+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

[1]

Surpassing millisecond coherence in on chip superconducting quantum memories by optimizing materials and circuit design,

S. Ganjam, Y . Wang, Y . Lu, A. Banerjee, C. U. Lei, L. Krayzman, K. Kisslinger, C. Zhou, R. Li, Y . Jiaet al., “Surpassing millisecond coherence in on chip superconducting quantum memories by optimizing materials and circuit design,”Nature Communications, vol. 15, no. 1, p. 3687, 2024

work page 2024
[2]

Machine learning for next-generation intelligent trans- portation systems: A survey,

T. Yuan, W. da Rocha Neto, C. E. Rothenberg, K. Obraczka, C. Barakat, and T. Turletti, “Machine learning for next-generation intelligent trans- portation systems: A survey,”Transactions on emerging telecommuni- cations technologies, vol. 33, no. 4, p. e4427, 2022

work page 2022
[3]

Step-wise deep learning models for solving routing problems,

L. Xin, W. Song, Z. Cao, and J. Zhang, “Step-wise deep learning models for solving routing problems,”IEEE Transactions on Industrial Informatics, vol. 17, no. 7, pp. 4861–4871, 2020

work page 2020
[4]

Leveraging artificial intelligence for enhanced supply chain optimiza- tion: a comprehensive review of current practices and future potentials,

O. S. Joel, A. T. Oyewole, O. G. Odunaiya, and O. T. Soyombo, “Leveraging artificial intelligence for enhanced supply chain optimiza- tion: a comprehensive review of current practices and future potentials,” International Journal of Management & Entrepreneurship Research, vol. 6, no. 3, pp. 707–721, 2024

work page 2024
[5]

Automated multi-objective construction logistics optimization system,

H. Said and K. El-Rayes, “Automated multi-objective construction logistics optimization system,”Automation in Construction, vol. 43, pp. 110–122, 2014

work page 2014
[6]

Flexible job-shop scheduling via graph neural network and deep reinforcement learning,

W. Song, X. Chen, Q. Li, and Z. Cao, “Flexible job-shop scheduling via graph neural network and deep reinforcement learning,”IEEE Transactions on Industrial Informatics, vol. 19, no. 2, pp. 1600–1610, 2022

work page 2022
[7]

F. Zhao, Z. Fu, L. Wang, and H. Sang, “A heterogeneous graph reinforcement learning framework with question-aware neighborhood aggregation and interoption prompt attention for dynamic flexible job shop scheduling problem,”IEEE Transactions on Industrial Informatics, 2026

work page 2026
[8]

Concorde tsp solver,

D. Applegate, R. Bixby, V . Chv´atal, and W. Cook, “Concorde tsp solver,” 2006

work page 2006
[9]

An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,

K. Helsgaun, “An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,”Roskilde: Roskilde University, vol. 12, pp. 966–980, 2017

work page 2017
[10]

Ant colony optimization: Introduction and recent trends,

C. Blum, “Ant colony optimization: Introduction and recent trends,” Physics of Life reviews, vol. 2, no. 4, pp. 353–373, 2005

work page 2005
[11]

Genetic algorithms,

J. H. Holland, “Genetic algorithms,”Scientific american, vol. 267, no. 1, pp. 66–73, 1992

work page 1992
[12]

No free lunch theorems for optimization,

D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,”IEEE transactions on evolutionary computation, vol. 1, no. 1, pp. 67–82, 2002

work page 2002
[14]

Pointer networks,

O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,”Advances in neural information processing systems, vol. 28, 2015

work page 2015
[15]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[16]

Flow network based generative models for non-iterative diverse candidate generation,

E. Bengio, M. Jain, M. Korablyov, D. Precup, and Y . Bengio, “Flow network based generative models for non-iterative diverse candidate generation,”Advances in neural information processing systems, vol. 34, pp. 27 381–27 394, 2021

work page 2021
[17]

Attention, Learn to Solve Routing Problems!

W. Kool, H. Van Hoof, and M. Welling, “Attention, learn to solve routing problems!”arXiv preprint arXiv:1803.08475, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Boosting neural combinatorial optimization for large-scale vehicle rout- ing problems,

F. Luo, X. Lin, Y . Wu, Z. Wang, T. Xialiang, M. Yuan, and Q. Zhang, “Boosting neural combinatorial optimization for large-scale vehicle rout- ing problems,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

Pomo: Policy optimization with multiple optima for reinforcement learning,

Y .-D. Kwon, J. Choo, B. Kim, I. Yoon, Y . Gwon, and S. Min, “Pomo: Policy optimization with multiple optima for reinforcement learning,”Advances in Neural Information Processing Systems, vol. 33, pp. 21 188–21 198, 2020

work page 2020
[20]

Hybrid-balance gflownet for solving vehicle routing problems,

N. Zhang and Z. Cao, “Hybrid-balance gflownet for solving vehicle routing problems,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[21]

Self- improved learning for scalable neural combinatorial optimization,

F. Luo, X. Lin, Z. Wang, X. Tong, M. Yuan, and Q. Zhang, “Self- improved learning for scalable neural combinatorial optimization,”arXiv preprint arXiv:2403.19561, 2024

work page arXiv 2024
[22]

Efficient neural combinatorial optimization solver for the min- max heterogeneous capacitated vehicle routing problem,

X. Wu, D. Wang, C. Wu, K. Qi, C. Miao, Y . Xiao, J. Zhang, and Y . Zhou, “Efficient neural combinatorial optimization solver for the min- max heterogeneous capacitated vehicle routing problem,”arXiv preprint arXiv:2507.21386, 2025

work page arXiv 2025
[23]

Flashattention: Fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in neural information processing systems, vol. 35, pp. 16 344–16 359, 2022

work page 2022
[24]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023

work page 2023
[25]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst conference on language modeling, 2024

work page 2024
[26]

Neural Combinatorial Optimization with Reinforcement Learning

I. Bello, H. Pham, Q. V . Le, M. Norouzi, and S. Bengio, “Neural combinatorial optimization with reinforcement learning,”arXiv preprint arXiv:1611.09940, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Neural combinatorial optimization with heavy decoder: Toward large scale generalization,

F. Luo, X. Lin, F. Liu, Q. Zhang, and Z. Wang, “Neural combinatorial optimization with heavy decoder: Toward large scale generalization,” Advances in Neural Information Processing Systems, vol. 36, pp. 8845– 8864, 2023

work page 2023
[28]

Glop: Learning global partition and local construction for solving large-scale routing problems in real-time,

H. Ye, J. Wang, H. Liang, Z. Cao, Y . Li, and F. Li, “Glop: Learning global partition and local construction for solving large-scale routing problems in real-time,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 18, 2024, pp. 20 284–20 292

work page 2024
[29]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Combining recurrent, convolutional, and continuous-time models with linear state space layers,

A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and C. R ´e, “Combining recurrent, convolutional, and continuous-time models with linear state space layers,”Advances in neural information processing systems, vol. 34, pp. 572–585, 2021

work page 2021
[31]

Hungry hungry hippos: Towards language modeling with state space models,

D. Y . Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. R ´e, “Hungry hungry hippos: Towards language modeling with state space models,”arXiv preprint arXiv:2212.14052, 2022

work page arXiv 2022
[32]

Transformers are rnns: Fast autoregressive transformers with linear attention,

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156– 5165

work page 2020
[33]

Retentive Network: A Successor to Transformer for Large Language Models

Y . Sun, L. Dong, S. Huang, S. Ma, Y . Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,”arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

T. Dao and A. Gu, “Transformers are ssms: Generalized models and ef- ficient algorithms through structured state space duality,”arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Reinforced Self-Training (ReST) for Language Modeling

C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Guet al., “Re- inforced self-training (rest) for language modeling,”arXiv preprint arXiv:2308.08998, 2023. XUet al.: RETHINKING EFFICIENCY IN NEURAL COMBINATORIAL OPTIMIZATION 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Self-rewarding language models,

W. Yuan, R. Y . Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, and J. E. Weston, “Self-rewarding language models,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[37]

Bopo: Neural combinatorial optimization via best-anchored and objective-guided pref- erence optimization,

Z. Liao, J. Chen, D. Wang, Z. Zhang, and J. Wang, “Bopo: Neural combinatorial optimization via best-anchored and objective-guided pref- erence optimization,”arXiv preprint arXiv:2503.07580, 2025

work page arXiv 2025
[38]

Preference optimization for combinatorial optimization problems,

M. Pan, G. Lin, Y .-W. Luo, B. Zhu, Z. Dai, L. Sun, and C. Yuan, “Preference optimization for combinatorial optimization problems,” in Forty-second International Conference on Machine Learning, 2025

work page 2025
[39]

Fine-tuning large language model for automated algorithm design,

F. Liu, R. Zhang, X. Lin, Z. Lu, and Q. Zhang, “Fine-tuning large language model for automated algorithm design,”arXiv preprint arXiv:2507.10614, 2025

work page arXiv 2025
[40]

Kvquant: Towards 10 million context length llm inference with kv cache quantization,

C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y . S. Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million context length llm inference with kv cache quantization,”Advances in Neural Information Processing Systems, vol. 37, pp. 1270–1303, 2024

work page 2024
[41]

Collaboration! towards robust neural methods for routing problems,

J. Zhou, Y . Wu, Z. Cao, W. Song, J. Zhang, and Z. Shen, “Collaboration! towards robust neural methods for routing problems,”Advances in Neural Information Processing Systems, vol. 37, pp. 121 731–121 764, 2024

work page 2024