Recognition: no theorem link
Rethinking Efficiency in Neural Combinatorial Optimization: Batched Preference Optimization with Mamba
Pith reviewed 2026-05-15 20:14 UTC · model grok-4.3
The pith
ECO uses batched preference optimization and a Mamba backbone to reach top neural performance on TSP and CVRP while cutting memory and raising throughput.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a Mamba-based policy with supervised warm-up and then iterative batched direct preference optimization, where local search only shapes preference pairs during training, ECO produces solutions on TSP and CVRP that surpass prior neural baselines in quality while requiring substantially less memory and delivering higher throughput at inference time without any search procedure.
What carries the argument
The batched direct preference optimization pipeline paired with a mixed Mamba encoder-decoder that decouples candidate generation from gradient updates and limits memory scaling on long sequences.
If this is right
- Neural solvers can reach high solution quality at deployment without invoking local search or other post-processing steps.
- Memory usage grows more slowly with problem size because the Mamba backbone replaces quadratic attention.
- Throughput rises because trajectory sampling and gradient steps are no longer interleaved at every iteration.
- The same two-stage pipeline can be applied to other routing problems that admit fast local search oracles for training data.
Where Pith is reading between the lines
- Preference-based training may replace reinforcement-learning loops in other sequential optimization domains where expert solutions are available offline.
- Mamba-style state-space models could become the default backbone for any combinatorial task whose instances exceed a few hundred nodes.
- Decoupling sampling from updates might reduce the hardware demands of training large policies for structured prediction tasks beyond routing.
Load-bearing premise
That preference pairs built with local search only during training will yield a policy whose inference-time outputs remain competitive without any search on new problem instances.
What would settle it
Running ECO without local search at inference on TSP instances larger than those seen in training and finding solution quality below that of search-augmented neural baselines would falsify the performance claim.
read the original abstract
We study efficiency as a first-class objective in Neural Combinatorial Optimization (NCO) and present ECO, an efficient learning framework that combines batched preference optimization with a Mamba backbone. Instead of tightly interleaving every policy update with on-policy rollouts, ECO decouples trajectory generation from gradient updates through two stages: supervised warm-up on pre-computed solutions and iterative Direct Preference Optimization (DPO) on batched candidate sets generated by the current policy. We pair this learning pipeline with a mixed Mamba encoder-decoder that reduces memory growth on long sequences and improves hardware utilization. A local-search-guided bootstrapping strategy is further used during training to widen preference margins and stabilize iterative improvement. Importantly, local search is only used to construct stronger preference pairs during training and is never invoked at inference time. On TSP and CVRP, ECO achieves the strongest overall performance among the compared neural baselines while also delivering clear advantages in memory usage and throughput. We provide additional analysis on memory scaling, throughput, and the contribution of each design component.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ECO, a framework for neural combinatorial optimization that pairs a Mamba encoder-decoder backbone with a two-stage batched preference optimization pipeline. Supervised warm-up on pre-computed solutions is followed by iterative Direct Preference Optimization (DPO) over batched candidate trajectories generated by the current policy; local search is used exclusively to construct stronger preference pairs during training and is never called at inference. On TSP and CVRP instances the method reports the strongest solution quality among neural baselines together with measurable reductions in memory footprint and gains in throughput, supported by scaling and ablation analyses.
Significance. If the reported performance and efficiency numbers hold under the stated experimental protocol, the work is significant for NCO because it treats memory and inference speed as first-class objectives rather than afterthoughts. The linear-time scaling of Mamba and the decoupling of on-policy rollouts via batched DPO directly address two recurring bottlenecks in the literature. Confining local search to the training phase only yields a purely neural inference procedure, which is a practical advantage for deployment. The explicit scaling and component-ablation studies provide concrete evidence for the design decisions.
minor comments (3)
- [§4.2] §4.2, Table 2: the reported gap to the best neural baseline on CVRP-100 is given without standard deviation across the five random seeds; adding error bars would strengthen the claim of consistent superiority.
- [§3.3] §3.3: the notation for the preference margin in the DPO loss (Eq. 7) re-uses the symbol β that was earlier defined for the Mamba state-space parameter; a distinct symbol or explicit disambiguation would avoid reader confusion.
- [Figure 4] Figure 4: the memory-scaling plot would benefit from an additional curve showing peak GPU memory versus sequence length for a standard transformer baseline under identical batch size, to make the Mamba advantage visually quantitative.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript, the recognition of its significance for neural combinatorial optimization, and the recommendation for minor revision. No specific major comments were provided in the report.
Circularity Check
No significant circularity detected
full rationale
The paper presents ECO as a two-stage pipeline (supervised warm-up followed by iterative batched DPO) paired with a Mamba encoder-decoder, where local search is explicitly restricted to constructing training preference pairs and never used at inference. No equations, derivations, or parameter-fitting steps are described that reduce the claimed performance gains or efficiency advantages to quantities defined by the same data or by self-referential construction. The framework relies on standard DPO and Mamba components with empirical validation on TSP/CVRP instances; the central claims remain independent of any load-bearing self-citation or ansatz smuggled through prior work by the same authors.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Direct Preference Optimization loss and Mamba recurrence assumptions hold for the combinatorial trajectory setting
Reference graph
Works this paper leans on
-
[1]
S. Ganjam, Y . Wang, Y . Lu, A. Banerjee, C. U. Lei, L. Krayzman, K. Kisslinger, C. Zhou, R. Li, Y . Jiaet al., “Surpassing millisecond coherence in on chip superconducting quantum memories by optimizing materials and circuit design,”Nature Communications, vol. 15, no. 1, p. 3687, 2024
work page 2024
-
[2]
Machine learning for next-generation intelligent trans- portation systems: A survey,
T. Yuan, W. da Rocha Neto, C. E. Rothenberg, K. Obraczka, C. Barakat, and T. Turletti, “Machine learning for next-generation intelligent trans- portation systems: A survey,”Transactions on emerging telecommuni- cations technologies, vol. 33, no. 4, p. e4427, 2022
work page 2022
-
[3]
Step-wise deep learning models for solving routing problems,
L. Xin, W. Song, Z. Cao, and J. Zhang, “Step-wise deep learning models for solving routing problems,”IEEE Transactions on Industrial Informatics, vol. 17, no. 7, pp. 4861–4871, 2020
work page 2020
-
[4]
O. S. Joel, A. T. Oyewole, O. G. Odunaiya, and O. T. Soyombo, “Leveraging artificial intelligence for enhanced supply chain optimiza- tion: a comprehensive review of current practices and future potentials,” International Journal of Management & Entrepreneurship Research, vol. 6, no. 3, pp. 707–721, 2024
work page 2024
-
[5]
Automated multi-objective construction logistics optimization system,
H. Said and K. El-Rayes, “Automated multi-objective construction logistics optimization system,”Automation in Construction, vol. 43, pp. 110–122, 2014
work page 2014
-
[6]
Flexible job-shop scheduling via graph neural network and deep reinforcement learning,
W. Song, X. Chen, Q. Li, and Z. Cao, “Flexible job-shop scheduling via graph neural network and deep reinforcement learning,”IEEE Transactions on Industrial Informatics, vol. 19, no. 2, pp. 1600–1610, 2022
work page 2022
-
[7]
F. Zhao, Z. Fu, L. Wang, and H. Sang, “A heterogeneous graph reinforcement learning framework with question-aware neighborhood aggregation and interoption prompt attention for dynamic flexible job shop scheduling problem,”IEEE Transactions on Industrial Informatics, 2026
work page 2026
-
[8]
D. Applegate, R. Bixby, V . Chv´atal, and W. Cook, “Concorde tsp solver,” 2006
work page 2006
-
[9]
K. Helsgaun, “An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,”Roskilde: Roskilde University, vol. 12, pp. 966–980, 2017
work page 2017
-
[10]
Ant colony optimization: Introduction and recent trends,
C. Blum, “Ant colony optimization: Introduction and recent trends,” Physics of Life reviews, vol. 2, no. 4, pp. 353–373, 2005
work page 2005
-
[11]
J. H. Holland, “Genetic algorithms,”Scientific american, vol. 267, no. 1, pp. 66–73, 1992
work page 1992
-
[12]
No free lunch theorems for optimization,
D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,”IEEE transactions on evolutionary computation, vol. 1, no. 1, pp. 67–82, 2002
work page 2002
-
[14]
O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,”Advances in neural information processing systems, vol. 28, 2015
work page 2015
-
[15]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[16]
Flow network based generative models for non-iterative diverse candidate generation,
E. Bengio, M. Jain, M. Korablyov, D. Precup, and Y . Bengio, “Flow network based generative models for non-iterative diverse candidate generation,”Advances in neural information processing systems, vol. 34, pp. 27 381–27 394, 2021
work page 2021
-
[17]
Attention, Learn to Solve Routing Problems!
W. Kool, H. Van Hoof, and M. Welling, “Attention, learn to solve routing problems!”arXiv preprint arXiv:1803.08475, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
Boosting neural combinatorial optimization for large-scale vehicle rout- ing problems,
F. Luo, X. Lin, Y . Wu, Z. Wang, T. Xialiang, M. Yuan, and Q. Zhang, “Boosting neural combinatorial optimization for large-scale vehicle rout- ing problems,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[19]
Pomo: Policy optimization with multiple optima for reinforcement learning,
Y .-D. Kwon, J. Choo, B. Kim, I. Yoon, Y . Gwon, and S. Min, “Pomo: Policy optimization with multiple optima for reinforcement learning,”Advances in Neural Information Processing Systems, vol. 33, pp. 21 188–21 198, 2020
work page 2020
-
[20]
Hybrid-balance gflownet for solving vehicle routing problems,
N. Zhang and Z. Cao, “Hybrid-balance gflownet for solving vehicle routing problems,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[21]
Self- improved learning for scalable neural combinatorial optimization,
F. Luo, X. Lin, Z. Wang, X. Tong, M. Yuan, and Q. Zhang, “Self- improved learning for scalable neural combinatorial optimization,”arXiv preprint arXiv:2403.19561, 2024
-
[22]
X. Wu, D. Wang, C. Wu, K. Qi, C. Miao, Y . Xiao, J. Zhang, and Y . Zhou, “Efficient neural combinatorial optimization solver for the min- max heterogeneous capacitated vehicle routing problem,”arXiv preprint arXiv:2507.21386, 2025
-
[23]
Flashattention: Fast and memory-efficient exact attention with io-awareness,
T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in neural information processing systems, vol. 35, pp. 16 344–16 359, 2022
work page 2022
-
[24]
Direct preference optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023
work page 2023
-
[25]
Mamba: Linear-time sequence modeling with selective state spaces,
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst conference on language modeling, 2024
work page 2024
-
[26]
Neural Combinatorial Optimization with Reinforcement Learning
I. Bello, H. Pham, Q. V . Le, M. Norouzi, and S. Bengio, “Neural combinatorial optimization with reinforcement learning,”arXiv preprint arXiv:1611.09940, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Neural combinatorial optimization with heavy decoder: Toward large scale generalization,
F. Luo, X. Lin, F. Liu, Q. Zhang, and Z. Wang, “Neural combinatorial optimization with heavy decoder: Toward large scale generalization,” Advances in Neural Information Processing Systems, vol. 36, pp. 8845– 8864, 2023
work page 2023
-
[28]
H. Ye, J. Wang, H. Liang, Z. Cao, Y . Li, and F. Li, “Glop: Learning global partition and local construction for solving large-scale routing problems in real-time,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 18, 2024, pp. 20 284–20 292
work page 2024
-
[29]
Efficiently Modeling Long Sequences with Structured State Spaces
A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
Combining recurrent, convolutional, and continuous-time models with linear state space layers,
A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and C. R ´e, “Combining recurrent, convolutional, and continuous-time models with linear state space layers,”Advances in neural information processing systems, vol. 34, pp. 572–585, 2021
work page 2021
-
[31]
Hungry hungry hippos: Towards language modeling with state space models,
D. Y . Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. R ´e, “Hungry hungry hippos: Towards language modeling with state space models,”arXiv preprint arXiv:2212.14052, 2022
-
[32]
Transformers are rnns: Fast autoregressive transformers with linear attention,
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156– 5165
work page 2020
-
[33]
Retentive Network: A Successor to Transformer for Large Language Models
Y . Sun, L. Dong, S. Huang, S. Ma, Y . Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,”arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
T. Dao and A. Gu, “Transformers are ssms: Generalized models and ef- ficient algorithms through structured state space duality,”arXiv preprint arXiv:2405.21060, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Reinforced Self-Training (ReST) for Language Modeling
C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Guet al., “Re- inforced self-training (rest) for language modeling,”arXiv preprint arXiv:2308.08998, 2023. XUet al.: RETHINKING EFFICIENCY IN NEURAL COMBINATORIAL OPTIMIZATION 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Self-rewarding language models,
W. Yuan, R. Y . Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, and J. E. Weston, “Self-rewarding language models,” inForty-first International Conference on Machine Learning, 2024
work page 2024
-
[37]
Z. Liao, J. Chen, D. Wang, Z. Zhang, and J. Wang, “Bopo: Neural combinatorial optimization via best-anchored and objective-guided pref- erence optimization,”arXiv preprint arXiv:2503.07580, 2025
-
[38]
Preference optimization for combinatorial optimization problems,
M. Pan, G. Lin, Y .-W. Luo, B. Zhu, Z. Dai, L. Sun, and C. Yuan, “Preference optimization for combinatorial optimization problems,” in Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[39]
Fine-tuning large language model for automated algorithm design,
F. Liu, R. Zhang, X. Lin, Z. Lu, and Q. Zhang, “Fine-tuning large language model for automated algorithm design,”arXiv preprint arXiv:2507.10614, 2025
-
[40]
Kvquant: Towards 10 million context length llm inference with kv cache quantization,
C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y . S. Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million context length llm inference with kv cache quantization,”Advances in Neural Information Processing Systems, vol. 37, pp. 1270–1303, 2024
work page 2024
-
[41]
Collaboration! towards robust neural methods for routing problems,
J. Zhou, Y . Wu, Z. Cao, W. Song, J. Zhang, and Z. Shen, “Collaboration! towards robust neural methods for routing problems,”Advances in Neural Information Processing Systems, vol. 37, pp. 121 731–121 764, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.