Title resolution pending

Proximal Policy Optimization Algorithms , author= · 2017

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

OLSF-TRS is a generalized sequential decision framework using structured combinatorial optimization and multi-agent reinforcement learning for order-tote-robot coordination in tote-handling robotic systems, with near-optimal performance on small scales and 8-30%+ improvements over heuristics onlarge

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.

Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning

cs.GT · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.

Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs

cs.SI · 2026-05-10 · unverdicted · novelty 6.0

LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service

cs.DC · 2026-05-08 · unverdicted · novelty 6.0

MARLaaS enables concurrent RL fine-tuning across up to 32 tasks using LoRA adapters and a disaggregated asynchronous architecture, matching single-task performance while improving accelerator utilization by 4.3x and cutting end-to-end time by 85%.

Rotation-Preserving Supervised Fine-Tuning

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

cs.LG · 2024-02-22 · conditional · novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

citing papers explorer

Showing 11 of 11 citing papers.

Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems cs.RO · 2026-05-09 · unverdicted · none · ref 108
OLSF-TRS is a generalized sequential decision framework using structured combinatorial optimization and multi-agent reinforcement learning for order-tote-robot coordination in tote-handling robotic systems, with near-optimal performance on small scales and 8-30%+ improvements over heuristics onlarge
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 31
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 40
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning cs.GT · 2026-05-11 · unverdicted · none · ref 31 · 2 links
Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.
Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs cs.SI · 2026-05-10 · unverdicted · none · ref 64
LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.
Crosslingual On-Policy Self-Distillation for Multilingual Reasoning cs.CL · 2026-05-10 · unverdicted · none · ref 9
COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation cs.CL · 2026-05-09 · unverdicted · none · ref 5
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service cs.DC · 2026-05-08 · unverdicted · none · ref 22
MARLaaS enables concurrent RL fine-tuning across up to 32 tasks using LoRA adapters and a disaggregated asynchronous architecture, matching single-task performance while improving accelerator utilization by 4.3x and cutting end-to-end time by 85%.
Rotation-Preserving Supervised Fine-Tuning cs.LG · 2026-05-08 · unverdicted · none · ref 121
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles cs.LG · 2026-05-08 · unverdicted · none · ref 109
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs cs.LG · 2024-02-22 · conditional · none · ref 74
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer