super hub Canonical reference

Group Sequence Policy Optimization

Bowen Yu, Chang Gao, Chujie Zheng, Mingze Li, Shixuan Liu, Xiong-Hui Chen · 2025 · cs.LG · arXiv 2507.18071

Canonical reference. 73% of citing Pith papers cite this work as background.

254 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 254 citing papers more from Bowen Yu arXiv PDF

abstract

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 31 method 10 baseline 2 dataset 2

citation-polarity summary

background 33 use method 7 baseline 2 use dataset 2 extend 1

claims ledger

abstract This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infras

authors

Bowen Yu Chang Gao Chujie Zheng Mingze Li Shixuan Liu Xiong-Hui Chen

co-cited works

representative citing papers

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

cs.AI · 2026-06-04 · conditional · novelty 8.0

Derives an exact telescoping decomposition of the naive RLVR reward-design estimator into null, elicitation, and reward-design terms on a tabular-GRPO simulator, measures the components across prior strengths, and validates via pre-registered factorial experiments plus re-audits of prior papers.

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.

Predictable GRPO: A Closed-Form Model of Training Dynamics

cs.LG · 2026-06-29 · unverdicted · novelty 7.0 · 2 refs

GRPO updates reduce to a damped oscillator whose mass, damping, and stiffness are fixed by optimizer hyperparameters plus one measured curvature scale, subsuming single-exponential saturation while adding inertial slow-start and group-size predictions.

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL

cs.CV · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

TempAct introduces a planner-executor RL framework with hierarchical group exploration and rewards to improve temporal consistency in autoregressive video diffusion models.

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

Socratic-SWE distills agent solving traces into skills that generate and filter targeted repair tasks, yielding iterative gains on SWE-bench suites under fixed compute.

Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces OPT* tasks and two training regimes (solver-guided online policy optimization with rank-based reward shaping and search-based offline RL) plus a theoretical link between search success and information extraction per budget unit, showing empirical gains in optimization-like reasoning.

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

ElasticMem enables LLM agents to learn adaptive latent memory retrieval and elastic budget allocation, improving QA accuracy by 24-26% and ALFWorld success by 27-66% over baselines with lower token cost.

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

Touch-R1: Reinforcing Touch Reasoning in MLLMs

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

cs.AI · 2026-05-25 · conditional · novelty 7.0

CUA-Gym generates 32,112 verified RLVR tuples across 110 mock environments, enabling trained models to reach 62.1% and 72.6% on OSWorld-Verified while transferring to WebArena.

Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.

Not only where, But when: Temporal Scheduling for RLVR

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generation tasks such as role-playing.

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.

citing papers explorer

Showing 47 of 47 citing papers after filters.

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR cs.AI · 2026-06-04 · conditional · none · ref 13 · internal anchor
Derives an exact telescoping decomposition of the naive RLVR reward-design estimator into null, elicitation, and reward-design terms on a tabular-GRPO simulator, measures the components across prior strengths, and validates via pre-registered factorial experiments plus re-audits of prior papers.
Tandem Reinforcement Learning with Verifiable Rewards cs.AI · 2026-06-26 · unverdicted · none · ref 24 · internal anchor
TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.
ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning cs.AI · 2026-06-11 · unverdicted · none · ref 77 · internal anchor
ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.
Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces cs.AI · 2026-06-03 · unverdicted · none · ref 17 · internal anchor
Introduces OPT* tasks and two training regimes (solver-guided online policy optimization with rank-based reward shaping and search-based offline RL) plus a theoretical link between search success and information extraction per budget unit, showing empirical gains in optimization-like reasoning.
Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor cs.AI · 2026-05-27 · unverdicted · none · ref 45 · internal anchor
Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents cs.AI · 2026-05-25 · conditional · none · ref 37 · internal anchor
CUA-Gym generates 32,112 verified RLVR tuples across 110 mock environments, enabling trained models to reach 62.1% and 72.6% on OSWorld-Verified while transferring to WebArena.
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation cs.AI · 2026-05-18 · unverdicted · none · ref 24 · internal anchor
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generation tasks such as role-playing.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents cs.AI · 2026-05-13 · unverdicted · none · ref 59 · 2 links · internal anchor
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 38 · 2 links · internal anchor
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
Which Tokens Matter? Adaptive Token Selection for RLVR with the Relative Surprisal Index cs.AI · 2026-06-30 · unverdicted · none · ref 34 · internal anchor
Introduces RSI metric and RSI-S filtering method for adaptive token selection in RLVR, reporting 2-3 point gains over GRPO on AIME/AMC benchmarks.
Omni-Perception Policy Optimization for Multimodal Emotion Reasoning cs.AI · 2026-06-24 · unverdicted · none · ref 34 · internal anchor
OPPO applies RL with an Omni-Perception Reward and masked-input KL loss to boost cue utilization and suppress hallucinations in emotion reasoning MLLMs, claiming SOTA results on MER-UniBench, MME-Emotion, and MEP-Bench.
IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing cs.AI · 2026-06-11 · unverdicted · none · ref 30 · 2 links · internal anchor
IterCAD is a multimodal agent framework using progressive SFT and geometry-aware RL for CAD tasks, with a new data pipeline, IterCAD-Bench, and CD-TR metric showing outperformance in executability and precision.
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization cs.AI · 2026-06-05 · unverdicted · none · ref 26 · internal anchor
PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning cs.AI · 2026-06-02 · unverdicted · none · ref 71 · internal anchor
EvoTrainer co-evolves LLM policies and training harnesses via empirical feedback to match or exceed human-engineered RL on math reasoning, code generation, and long-horizon software engineering.
Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling cs.AI · 2026-05-28 · unverdicted · none · ref 42 · internal anchor
GDCR assigns step-level rewards via distance to the answer node in a training-time ER graph and SAPO combines these with trajectory advantages for credit assignment in agentic search.
TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning cs.AI · 2026-05-27 · unverdicted · none · ref 32 · internal anchor
TRACER combines a controller-regret layer using regret matching for speak/skip decisions with a generation-credit layer using GSPO rewards to enable learned collaboration in multi-LLM reasoning.
EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA cs.AI · 2026-05-27 · unverdicted · none · ref 20 · internal anchor
EAPO uses policy entropy ratio to adaptively weight positive samples in RLVR for open-ended QA, claiming better diversity and stability than fixed-weight baselines on medical datasets.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning cs.AI · 2026-05-21 · unverdicted · none · ref 126 · internal anchor
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 44 · internal anchor
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents cs.AI · 2026-05-19 · unverdicted · none · ref 42 · internal anchor
SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation cs.AI · 2026-05-17 · unverdicted · none · ref 28 · internal anchor
SAPO computes per-reasoning-step group-relative advantages in RL to improve credit assignment for structured generation of semantic identifiers in recommendation systems.
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning cs.AI · 2026-05-13 · unverdicted · none · ref 17 · internal anchor
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 19 · internal anchor
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key cs.AI · 2026-05-07 · unverdicted · none · ref 106 · 3 links · internal anchor
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning cs.AI · 2026-05-02 · unverdicted · none · ref 21 · internal anchor
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning cs.AI · 2026-04-30 · unverdicted · none · ref 27 · 3 links · internal anchor
A 4B-parameter vision-language model trained on rubric-guided synthetic wafer defect data reaches 6.493 LLM-Judge score, nearly matching Gemini-3-Flash at 7.149 for on-premise industrial use.
MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation cs.AI · 2026-04-16 · unverdicted · none · ref 34 · internal anchor
MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models cs.AI · 2026-04-03 · unverdicted · none · ref 6 · internal anchor
Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.
Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale cs.AI · 2026-03-11 · unverdicted · none · ref 30 · internal anchor
LOM unifies ontology construction, semantic alignment, and deterministic reasoning in one architecture, reporting 88.8% accuracy on ontology completion and 94% on complex graph reasoning tasks.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 49 · internal anchor
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models cs.AI · 2026-06-08 · unverdicted · none · ref 48 · internal anchor
AlloSpatial adds structured allocentric priors and a harness for tool-use and arbitration to improve spatial reasoning in foundation models, with 5-18% gains on VSI-Bench and MindCube in training-free settings and further gains after RL internalization.
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs cs.AI · 2026-05-27 · unverdicted · none · ref 52 · internal anchor
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments cs.AI · 2026-05-26 · unverdicted · none · ref 47 · internal anchor
NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models cs.AI · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
Defines Entropy-Gradient Inversion as a negative entropy-gradient correlation fingerprinting LRM reasoning and proposes CorR-PO to embed it in RL regularization, claiming consistent outperformance on benchmarks.
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors cs.AI · 2026-05-09 · unverdicted · none · ref 51 · internal anchor
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
Learning CLI Agents with Structured Action Credit under Selective Observation cs.AI · 2026-05-08 · unverdicted · none · ref 75 · internal anchor
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length cs.AI · 2026-05-04 · unverdicted · none · ref 16 · internal anchor
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models cs.AI · 2026-04-18 · unverdicted · none · ref 11 · internal anchor
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 54 · internal anchor
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
StaRPO: Stability-Augmented Reinforcement Policy Optimization cs.AI · 2026-04-10 · unverdicted · none · ref 48 · internal anchor
StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.
Xiaomi-GUI-0 Technical Report cs.AI · 2026-06-30 · unverdicted · none · ref 52 · 2 links · internal anchor
Xiaomi-GUI-0 reports 72.0% success on RealMobile and 78.9% on AndroidWorld via real-device closed-loop training with multi-source data and three-stage RL pipeline.
Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs cs.AI · 2026-06-16 · unverdicted · none · ref 34 · internal anchor
E³RL uses dynamic thresholds on epistemic entropy from autoregressive cross-entropy to enable erasable RL in LLM reasoning, reporting 5.349% and 6.514% gains on AIME for 4B and 8B models over prior SOTA.
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning cs.AI · 2026-05-01 · unverdicted · none · ref 60 · 2 links · internal anchor
AEM adaptively modulates response-level entropy in agentic RL to improve credit assignment and exploration-exploitation balance, yielding gains on ALFWorld, WebShop, and SWE-bench.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning cs.AI · 2026-04-20 · conditional · none · ref 18 · internal anchor
Novice programmers completed more tasks with lower workload using GitHub Copilot versus a human partner, but reported significantly more positive and arousing emotions with the human teammate.
Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care cs.AI · 2026-06-08 · unverdicted · none · ref 11 · internal anchor
The paper describes Baichuan-M4, a coordinated medical agent system that reports leading scores across static knowledge, dynamic consultation, long-context memory, retrieval, OCR, and multimodal tasks with a 3.3% hallucination rate.
Rethinking Agentic Reinforcement Learning In Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 127 · 3 links · internal anchor
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

Group Sequence Policy Optimization

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer