pith. sign in

arxiv: 2505.08735 · v1 · pith:II3TKI3Inew · submitted 2025-05-13 · 💻 cs.LG

Preference Optimization for Combinatorial Optimization Problems

classification 💻 cs.LG
keywords preferenceoptimizationcombinatorialpolicyproblemrewardsignalsexisting
0
0 comments X
read the original abstract

Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring expert knowledge. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficiency. In this paper, we propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among sampled solutions. Methodologically, by reparameterizing the reward function in terms of policy and utilizing preference models, we formulate an entropy-regularized RL objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate local search techniques into the fine-tuning rather than post-processing to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on various benchmarks, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method significantly outperforms existing RL algorithms, achieving superior convergence efficiency and solution quality.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning to Place Guards by Reinforcement: A Geo-Free Neural Policy for the Vertex-Guard Art Gallery Problem

    cs.LG 2026-06 unverdicted novelty 7.0

    A reinforcement learning policy for the vertex-guard art gallery problem encodes sufficient geometric information in its encoder to allow a simple classifier to achieve high coverage feasibility out of distribution.

  2. Baseline-Free Policy Optimization for Neural Combinatorial Optimization

    cs.LG 2026-06 conditional novelty 6.0

    GRPO matches POMO solution quality within 2% on TSP/CVRP while avoiding REINFORCE training collapse on TSP-100 without needing a rollout baseline.