OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
arXiv preprint arXiv:2210.06718 (2022) 6
7 Pith papers cite this work. Polarity classification is still indexing.
years
2026 7verdicts
UNVERDICTED 7representative citing papers
Anchor-TS corrects bias from distribution shift in offline-to-online bandits by taking the median of an online posterior sample, a hybrid posterior sample, and the online sample mean.
SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.
WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
citing papers explorer
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
Anchor-TS corrects bias from distribution shift in offline-to-online bandits by taking the median of an online posterior sample, a hybrid posterior sample, and the online sample mean.
-
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
-
Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation
Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.
-
WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.
-
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
-
Fisher Decorator: Refining Flow Policy via a Local Transport Map
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.