Probing RLVR training instability through the lens of objective-level hacking

· 2026 · cs.AI · arXiv 2602.01103

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.

representative citing papers

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.

Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

cs.LG · 2026-06-07 · unverdicted · novelty 4.0

Introduces Discrepancy-Constrained MDP (DCMDP) with Lagrangian relaxation to optimize LLM RL under train-inference discrepancy constraints, claiming performance gains on 8B and 30B models.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 100 · internal anchor
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy cs.LG · 2026-06-07 · unverdicted · none · ref 1 · internal anchor
Introduces Discrepancy-Constrained MDP (DCMDP) with Lagrangian relaxation to optimize LLM RL under train-inference discrepancy constraints, claiming performance gains on 8B and 30B models.

Probing RLVR training instability through the lens of objective-level hacking

fields

years

verdicts

representative citing papers

citing papers explorer