Title resolution pending

Trust Region Policy Optimization , author= · 2017

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

PWO is a trust-region optimizer for autoregressive NQS that improves stability over Adam and stochastic reconfiguration methods while scaling to 1.5B-parameter models on spin systems.

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

cs.CL · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

cs.LG · 2024-02-22 · conditional · novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

cs.LG · 2026-05-21 · unverdicted · novelty 4.0

GPLD applies a row-wise Jacobian penalty to DreamerV3's posterior latent distribution, producing higher sample efficiency on DeepMind Control proprioceptive tasks.

citing papers explorer

Showing 5 of 5 citing papers.

One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective cs.LG · 2026-07-02 · unverdicted · none · ref 19
PWO is a trust-region optimizer for autoregressive NQS that improves stability over Adam and stochastic reconfiguration methods while scaling to 1.5B-parameter models on spin systems.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents cs.CL · 2026-05-13 · unverdicted · none · ref 124 · 2 links
A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex cs.LG · 2026-05-07 · unverdicted · none · ref 36 · 2 links
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs cs.LG · 2024-02-22 · conditional · none · ref 66
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics cs.LG · 2026-05-21 · unverdicted · none · ref 19
GPLD applies a row-wise Jacobian penalty to DreamerV3's posterior latent distribution, producing higher sample efficiency on DeepMind Control proprioceptive tasks.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer