PWO is a trust-region optimizer for autoregressive NQS that improves stability over Adam and stochastic reconfiguration methods while scaling to 1.5B-parameter models on spin systems.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
GPLD applies a row-wise Jacobian penalty to DreamerV3's posterior latent distribution, producing higher sample efficiency on DeepMind Control proprioceptive tasks.
citing papers explorer
-
One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective
PWO is a trust-region optimizer for autoregressive NQS that improves stability over Adam and stochastic reconfiguration methods while scaling to 1.5B-parameter models on spin systems.
-
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.
-
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
-
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
-
Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics
GPLD applies a row-wise Jacobian penalty to DreamerV3's posterior latent distribution, producing higher sample efficiency on DeepMind Control proprioceptive tasks.