Canonical reference

Title resolution pending

Proximal Policy Optimization Algorithms , author= · 2017

Canonical reference. 100% of citing Pith papers cite this work as background.

38 Pith papers citing it

Background 100% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

OLSF-TRS is a generalized sequential decision framework using structured combinatorial optimization and multi-agent reinforcement learning for order-tote-robot coordination in tote-handling robotic systems, with near-optimal performance on small scales and 8-30%+ improvements over heuristics onlarge

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.

Discovering Reinforcement Learning Interfaces with Large Language Models

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

LIMEN discovers effective RL interfaces by using LLMs to evolve observation and reward programs together from raw state, guided by policy training success, outperforming single-component optimization.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

Discrete Tilt Matching

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Discrete Tilt Matching recasts dLLM fine-tuning as state-level matching of tilted local unmasking posteriors, producing a stable weighted cross-entropy loss that improves Sudoku and Countdown performance when applied to LLaDA-8B-Instruct.

JAXenstein: Accelerated Benchmarking for First-Person Environments

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

JAXenstein ports the Wolfenstein 3D engine to JAX to create a fast, scalable benchmark for first-person visual RL that is several times quicker than existing vision-based alternatives.

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

GoLongRL releases a 23K-sample open long-context RL dataset spanning 9 tasks and introduces TMN-Reweight to improve multitask optimization, achieving performance comparable to much larger models under GRPO.

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

cs.CL · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.

GAGPO: Generalized Advantage Grouped Policy Optimization

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

GAGPO computes step-aligned temporal advantages from grouped rollout samples without a learned critic, enabling stable policy optimization in multi-turn agent environments.

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.

Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

FPILOT optimizes pre-trained RL trading policies at inference time using forecasted price trajectories to improve portfolio allocations and risk-adjusted returns on the DJ30 benchmark.

Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning

cs.GT · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.

Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs

cs.SI · 2026-05-10 · unverdicted · novelty 6.0

LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.

MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service

cs.DC · 2026-05-08 · unverdicted · novelty 6.0

MARLaaS enables concurrent RL fine-tuning across up to 32 tasks using LoRA adapters and a disaggregated asynchronous architecture, matching single-task performance while improving accelerator utilization by 4.3x and cutting end-to-end time by 85%.

Rotation-Preserving Supervised Fine-Tuning

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.

Milestone-Guided Policy Learning for Long-Horizon Language Agents

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.

Vanishing L2 regularization for the softmax Multi Armed Bandit

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

Vanishing L2 regularization yields provable convergence for softmax MAB policies and improves empirical performance.

LLM Output Detectability and Task Performance Can be Jointly Optimized

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.

SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration cs.AI · 2026-05-07 · unverdicted · none · ref 31
Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 75
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 112
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models cs.AI · 2026-05-18 · unreviewed · ref 53
Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning cs.AI · 2026-05-13 · unreviewed · ref 88

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer