arXiv preprint arXiv:2210.06718 (2022) 6

Song, Y · 2022 · arXiv 2210.06718

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Anchor-TS corrects bias from distribution shift in offline-to-online bandits by taking the median of an online posterior sample, a hybrid posterior sample, and the online sample mean.

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.

Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.

WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

cs.RO · 2026-05-01 · unverdicted · novelty 6.0

Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

Fisher Decorator: Refining Flow Policy via a Local Transport Map

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

citing papers explorer

Showing 7 of 7 citing papers.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 34
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift cs.LG · 2026-05-11 · unverdicted · none · ref 13
Anchor-TS corrects bias from distribution shift in offline-to-online bandits by taking the median of an online posterior sample, a hybrid posterior sample, and the online sample mean.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data cs.LG · 2026-05-07 · unverdicted · none · ref 13
SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation cs.LG · 2026-04-15 · unverdicted · none · ref 8
Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.
WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning cs.LG · 2026-04-10 · unverdicted · none · ref 13
WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies cs.RO · 2026-05-01 · unverdicted · none · ref 44
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
Fisher Decorator: Refining Flow Policy via a Local Transport Map cs.LG · 2026-04-20 · unverdicted · none · ref 55
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

arXiv preprint arXiv:2210.06718 (2022) 6

fields

years

verdicts

representative citing papers

citing papers explorer