ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

· 2026 · cs.LG · arXiv 2601.21484

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design. The code is available at https://github.com/sheriyuo/ETS.

representative citing papers

HTAM: Hierarchical Transition-Attended Memory for Operator Optimization

cs.CL · 2026-05-28 · unverdicted · novelty 5.0

HTAM builds a Hierarchical Transition Graph to organize coarse global directions and detailed local strategies for guiding LLM-based CUDA kernel optimization, improving results on KernelBench.

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

cs.LG · 2026-05-19

citing papers explorer

Showing 2 of 2 citing papers.

HTAM: Hierarchical Transition-Attended Memory for Operator Optimization cs.CL · 2026-05-28 · unverdicted · none · ref 21 · internal anchor
HTAM builds a Hierarchical Transition Graph to organize coarse global directions and detailed local strategies for guiding LLM-based CUDA kernel optimization, improving results on KernelBench.
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting cs.LG · 2026-05-19 · unreviewed · ref 20 · internal anchor

ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

fields

years

verdicts

representative citing papers

citing papers explorer