RL fine-tuning of Qwen2.5-Coder-14B with GRPO and feasibility-gated reward produces reusable constraint-aware Simulated Annealing solvers for Synergistic Dependency Selection, reducing gap to virtual best solver from 28.7% to 5.0% at 91x lower cost.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers
RL fine-tuning of Qwen2.5-Coder-14B with GRPO and feasibility-gated reward produces reusable constraint-aware Simulated Annealing solvers for Synergistic Dependency Selection, reducing gap to virtual best solver from 28.7% to 5.0% at 91x lower cost.